Technical Articles
Handling text files in Groovy script of CPI (SAP Cloud Platform Integration).
Introduction:
Handling huge text files (which are either csv or fixed length)is a challenge in CPI (SAP Cloud Platform Integration) .
Mostly before converting them to xml required for mapping, we do read them via groovy scripts and also manipulate the data. Most often this is done via converting them to string format , which is very much memory intensive.
In this blog post, I will show alternate ways to handle them, not only how to read large files but also how to manipulate them.
Hope you will enjoy the reading.
Main Section:
In CPI (SAP Cloud Platform Integration) sometimes, we come across scenarios where we need to process an input csv or any other character delimited text file.
Most often these files are huge compared to when we get data as xml or json format.
This date which can be “,” or tab or “|” delimited or is of fixed length, creates additional complexity as first they have to be read, sorted, converted to xml (for mapping to some target structure) before they can be finally processed. Also, sometimes we have to do various checks on number of fields to validate if a line in the file is worth processing or not , before hand, to stop flow of unnecessary data.
Like: File -> input.csv
A,12234,NO,C,20190711,……
A,26579,NO,D,20190701,…….
……………………………………………..
……………………………………………..
Say, we have to process all lines of above file where fourth field has Flag set to ‘D’, or Debit indicator.
So, in above example after reading the file we should only keep lines which has ‘D’ as fourth field and hence line 1 above should not be processed further.
Here in below we will see how to handle text, csv files. Especially, Huge files and how to process each lines from them with out converting to String which is more memory consuming.
*. Reading large files :
We normally start our scripts by converting the input payload to String Object.
String content= message.getBody(String) // this line is mostly used in scripts.
But in case of large files, the above line converts the whole data to String and stores them in memory, which is not at all a good practice. Further any new changes on them by creating or replacing with new String Objects takes more space. This also has the probability of having – OutOfMemoryError Exception.
The better way is to handle them as stream. There are two class that can handle stream data.
a. java.io.Reader -> handles data as character or text stream
b. java.io.InputStream -> handles data as raw or binary stream.
Depending on the level of control you need over data, or business requirement you can use one of them. Mostly the Reader class is easier to use as we get data as text/character (UTF-16)rather then raw binary data (UTF-8).
Reading Data in CPI groovy script via java.io.Reader:
Reading Data at each field or word level, for each line:
*. Not a good way to do replace on data in CPI Groovy:
The String way of doing it –
The better approach of doing a replace while reading it as Stream:
*. Reading payload as an java.io.InputStream, stream object:
Conclusion:
This blog post, is written to ease the pain of developers, as while building Iflows, we do come across multiple cases where in, we need to handle large text files in csv or in other delimited format, which requires reading the entire file, sometimes working on data of each line via parsing-text etc.
In all those cases, the above blog post can be helpful to build required groovy scripts quickly, to be used in CPI (SAP Cloud Platform Integration) iflows, to handle these types of data.
It hastens those developments by providing architecture and re-usable codes on how to achieve the outcome.
I will look forward to your inputs and suggestions.
Great one!! Keep Blogging Subhojit 🙂
Thanks Arindam.
HI Subhojit,
I am reading a zip file through Groovy script. It works find until the size of the zip file is less than 4MB. If the zip file is more than 4MB, it is giving the below error.
When I tried to print the body.available() in the log, it shows 0 for files more than 4MB.
I used message.getBodySize() method instead of body.available(), but still its not working.
The maximum zip file size that we expect in real time would be more than 70MB.
Below is the program that I use to read through the zip file.
Can you please guide me where I am wrong?
Regards,
Anand...
Hi can you try to write your code like below ( remember , you have to convert the below code to the way you need. But the overall concept still remains same.)
======================================================================
def messageLog = messageLogFactory.getMessageLog(message);
InputStream is = message.getBody(InputStream.class);
ByteArrayOutputStream out=new ByteArrayOutputStream();
int n;
boolean canRead = false;
def myData =''
while ((n = is.read()) > -1){
if (n==80 && !canRead)
{
canRead = true;
}
if (!canRead){
continue;
}
out.write(n);
}
// def totalstring = out.toString("UTF-8");
InputStream is2 = new ByteArrayInputStream(out.toByteArray());
ZipInputStream zipStream = new ZipInputStream(is2);
ZipEntry entry=zipStream.getNextEntry();
byte[] buf=new byte[1024];
while (entry != null) {
if (entry.getName().contains("PDF")) {
ByteArrayOutputStream baos=new ByteArrayOutputStream();
int m;
while ((m=zipStream.read(buf,0,1024)) > -1) {
baos.write(buf,0,m);
}
myData = new String(baos.toByteArray(),StandardCharsets.UTF_8).replace("\"UTF-8\"\n","")
message.setBody(new String(baos.toByteArray(),StandardCharsets.UTF_8).replace("\"UTF-8\"\n",""));
}
zipStream.closeEntry();
entry=zipStream.getNextEntry();
}
messageLog.setStringProperty("Logging#5", "Printing Input Payload As Attachment")
messageLog.addAttachmentAsString("#ZIP CONTENT- payment_gl(PDF)", myData, "text/plain");
message.setBody(myData)
return message;
=======================================================================
If it still gives same error , please try to open a ticket to CPI team.
Hi mate,
Thanks for your reply.
My bad, I forgot to mention that I need to encode the pdf content. Actually, I had other set of code, which is able to read through more than 10MB zip file, but I could not encode. It was giving an error like "Stream close". That's why I changed the code to read it into the FileOutputStream.
When I added the encoding part in your code, its giving the same error like Stream close. As I am inserting this PDF into SuccessFactors, I need to do base64Encoding. Please see the actual code.
FYI, your code is able to read through all the files inside zip file. If you could help to do the encoding with your code, then that would be great.
Hi S,
I am able to do the encoding.
Thanks
Anand...
Hi Subhojit,
Great blog!
I have a requirement to read 3 very large csv files and simply combine them and send to receiver.
While I can use the input stream to read the files and use the memory space efficiently, the combining part using aggregation will need for the file to be converted to xml, since aggregation in CPI works only with xml.
And since I will be doing an aggregation, this large data will still be stored in the data store. Isn't it?
Do you have any ideas/work around to manage that?
Thanks,
Shubham
Hi Subhojit,
We have a flat file where headers have special characters. We want to use replace only for header line. How do we achieve that using groovy?
Thanks,
Hemant
Thanks for your blog. I tried your method and it doesn't work.
My interface is extracting an email attachment via the sender mail adapter. I can see the attachment does get extracted by the mail adapter and saved into the body via the trace.
However, when I used the groovy script to read the attachment via the getBody, nothing gets read. I verified by the body.length() and is 0.
Here is the simple code for me to get the body:
String body = message.getBody(java.lang.String)
body.length() is 0.
I used your array method and the array has size() 0.
Is that a bug in CPI as getbody is local CPI method.
Thanks Jonathan.
Hi Subhojit,
I tried implementing your logic (tried using io. Reader instead of String)
But i am facing an issue.
Below is my input payload
Below is the script :
What could be the possible solution for the below error?