Technical Articles
Groovy to extract text from PDF in CPI
Introduction
This blog helps you to solve a custom requirement to extract text from pdf with the help of groovy.
Note: This groovy will not work on formatted text files (images, bullet points. workflows).
Current Scenario: No blogs are available to extract text from pdf in SAP CPI.
Why we are doing so?
It gives us the flexibility to work with PDF files. Most of the time the content that will be coming to SAP Cloud Platform Integration will be in XML, JSON, CSV, and EDI. So, it can be easily extracted by this groovy and the rest transformation can be done as per the scenario.
PROCEDURE:
STEP 1: Download the pdfbox JAR file and upload it to your iFlow.
- Download the pdfbox JAR file from the following link
- Download the fontbox JAR file from the following link.
- Upload the JAR file in the Resources tab of your iFlow.
STEP 2: Take a sample payload for PDF conversion.
This is the sample CSV payload used for conversion.
Material_Name,Material_ID,Material_Number
Iron,KAU145,240
Copper,KAU146,800
Zinc,KAU222,180
Cobalt,KAU338,546
Pdf screenshot of the above CSV file
STEP 3: Use Groovy Script in your iFlow to extract text from PDF.
I-Flow Explanation:
- We are using a HTTP adapter to trigger the Integration flow with the pdf file.
- Then we are using “Groovy Script” to extract the content of the PDF.
- After that, we are using “CSV to XML Converter” to convert CSV files to XML.
Postman Configuration:
Do the Postman Configuration by referring to the image below.
Groovy Script:
import com.sap.gateway.ip.core.customdev.util.Message;
import java.util.HashMap;
// package org.apache.pdfbox.examples.util;
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper
public static String readFromPDF(InputStream input){
PDDocument pd;
try {
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); // Start Page
// stripper.setEndPage(1); // End Page
String text = stripper.getText(pd);
if (pd != null) {
pd.close();
}
return text.toString()
} catch (Exception e){
e.printStackTrace();
}
return null
}
def Message processData(Message message) {
def body = message.getBody();
InputStream IS = body;
String res = readFromPDF(IS)
message.setBody(res);
return message;
}
Groovy Script Explanation:
- First, we are fetching the body and converting it to InputStream.
- Then we are calling a function readFromPDF and passing the body.
- If there are multiple pages in the PDF or you want to take content from certain pages then you can use stripper.setStartPage(1); & stripper.setEndPage(1); methods.
- At last we are using stripper.getText(); method to read content.
Output:
<?xml version='1.0' encoding='UTF-8'?>
<Record>
<root>
<Material_Name>Iron</Material_Name>
<Material_ID>KAU145</Material_ID>
<Material_Number>240</Material_Number>
</root>
<root>
<Material_Name>Copper</Material_Name>
<Material_ID>KAU146</Material_ID>
<Material_Number>800</Material_Number>
</root>
<root>
<Material_Name>Zinc</Material_Name>
<Material_ID>KAU222</Material_ID>
<Material_Number>180</Material_Number>
</root>
<root>
<Material_Name>Cobalt</Material_Name>
<Material_ID>KAU338</Material_ID>
<Material_Number>546</Material_Number>
</root>
</Record>
More Sample PDFs and their Output:
850 EDI pdf to 850 EDI file
In the above image, you can see that I have used an 850 EDI in pdf format, from which all the text can be easily extracted by using this groovy and can be used in CPI as per your requirement.
You can use the same EDI payload from this link.
Sample PO to the relevant text
In the above image, we are able to extract the content of the image, but from this text, we will not be able to convert it to an EDI file as the format of each PO may vary, and writing a common script will be difficult.
So this groovy is not recommended for formatted pdf as shown in the above image.
You can visit this site to download the sample PO.
Conclusion:
So, to conclude, this blog helps to extract contents from the PDF using Groovy Script.
Check out the link for more helpful information about Cloud Platform Integration (CPI).
If you have any queries, please feel free to ask your question in the comments. I would request everyone to provide your feedback and like if this blog post finds helpful for you.
Thanks & Regards,
Samarjit Singha
Thanks for amazing write-up
Thanks, brother, I needed this in my project.
Aoa, Dear I need to talk you regarding this article, I have tried but not success.
Thanks a lot, Samarjit, for this blog.
If I have a customer's PO in PDF format, is it possible for me to process that PDF PO and create ORDERS IDoc or EDI (850)?
Hi Francis
I have tested it with 850 PO and it's working fine.
Thanks,
Samarjit
Hi Samanjit,
What I meant is that the customer's PO is in PDF, not necessarily an 850 PO in PDF.
Thanks,
Francis
Hi Samarjit,
Like this sample image below of a PDF PO.
Thanks,
Francis
Hi Francis,
This scenario will not work for your problem statement, as in this scenario, text can be extracted only from a pdf containing plain text characters excluding bullet points, images, tables, etc.
I have one query regarding your problem statement, how are you planning to convert the extracted text to an EDI format ? As I think there is no capability in CPI to convert an actual Purchase Order (as in the above sample image) to EDI or IDOC.
If you have an actual EDI or IDOC in PDF format then this scenario will work.
Thanks,
Samarjit
Hi Samarjit,
Up to PI 7.11, there was the SAP Conversion Agent.
Thanks,
Francis
Hello Samarjit,
Could you please share the pdf files of the tested 850 PO and the Blog.pdf in above your screen?
Thanks!
Lucy
Hi Lucy,
I have updated the blog. Please go through it. If you have some more queries feel free to ask.
Regards,
Samarjit
Hello Samarjit,
Thanks for your update! Could you please enclose the pdf file as attachment? so I can download it and use it in postman to simulate it in my iflow.
I tried to simulate the pdf reading like in your blog, not yet get succeed. Please see iflow errors below.
Best Regards,
Lucy
Hi Lucy,
Thanks for pointing out the errors, there was some issue with the JAR files of pdfbox and fontbox, So I have updated the blog with the new links. Please download the new JAR files and upload them to your Resources tab of the iflow.
You can download the pdf files from this link.
Thanks,
Samarjit
Thanks a lot Samarjit! It's working fine now.
Hi Samarjit,
Thank you for this blog. I'm working on a custom scenario and this blog helps a lot.
Thanks,
Raunak
Samarjit Singha for better understanding please add to blog a screenshot of the pdf you have used PDF so that everyone can see how it looks.
Thanks
I agree with you. I want to know how the pdf file looks like to simulate the scenario to understand how it works.
Hi Eurico,
I have updated the blog with the same.
Regards,
Samarjit
Hi Samarjit
Thanks for the blog post. Just as a general note: If I would get such a requirement, I would first of all challenge the requester or owner of the sender system and ask why in the world they would send a CSV file in PDF format and not simply as a CSV text string... If there's really a hard requirement for this, sure your script and the library would come in handy. However I think the more interesting use case would be processing of a formatted document, which is much more difficult as you point out.
There is a creative solution using SAP RPA here (without Cloud Integration though): https://blogs.sap.com/2021/09/07/translating-pdf-documents-with-sap-intelligent-robotic-process-automation-and-the-document-translation-service/
Philippe
Hi Samarjit,
I am facing following error - would you guide me to resolve following issue.
My .XSD file
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Material">
<xs:complexType>
<xs:sequence>
<xs:element name="Items">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="Material_Name"/>
<xs:element type="xs:string" name="Material_ID"/>
<xs:element type="xs:integer" name="Material_Number"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>