UPLOADING HUGE FILE FROM SAP PO TO AWS S3

poornimal_nathan · ‎09-09-2020

UPLOADING A FILE FROM PO TO AWS s3 WITHOUT SIZE RESTRICTIONS

One of the many challenges that we face in On-Premise to Cloud migration is the difference in approach to integration. While solutions on the cloud have a great flexibility in making messages as granular as possible, on-premise solutions are bound to restricting factors like bandwidth, memory and message sizes. As a basic premise to SAP PO development, our aim as developers is to optimize the many factors of an integration requirement, message size being very important.

This document is aimed at providing one of the many solutions that we can adapt to transfer a file from PO to AWS s3 bucket without us having to worry about the size of the file. AWS has a list of APIs available for us to use to upload files into the s3 bucket. As mentioned above, AWS doesn’t have a restriction on the size of the file received, but this issue is prominent in PO. You can find details of AWS API details here - https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html

Amazon gives us 3 options to upload files into s3 via APIs

Single Chunk Upload - https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-header-based-auth.html

Multi Chunk Upload - https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html

Multi Part Upload - https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

We had a file size restriction of ~900KB which wasn’t very promising. Amazon’s Single Chunk Upload was the best option. This is covered extensively in Rajesh PS’s blog -

https://blogs.sap.com/2019/05/31/integrating-amazon-simple-storage-service-amazon-s3-and-sap-ecc-v6....

Multi-chunk upload was not feasible for PO since it requires an HTTP connection to be open until all the chunks have been transferred. That leaves us with the Multipart Upload option. I will attempt to cover how I transferred a huge file using multipart upload. The basic design is as follows:

Pic 1: AWS Multipart Upload

AWS MULTIPART UPLOAD APIs

Multipart Upload works by initiating an API. AWS returns us an UploadId. The smaller files are then sent one by one with the UploadId and part number. AWS allows for parallel sending of the multiple parts, but we will need to know the sequence number beforehand, which is not possible in this design. AWS returns an ETag for each file uploaded. After all the parts have been uploaded, you will then need to call the Complete Multipart Upload API with all the ETags and the Part Number. This enables AWS to ‘stitch’ back the multiple small files into one large file based on the details sent. I’ve had to develop the AWS signature from scratch: https://github.com/pnathan01/AWSSAPMultipartUploadLibrary. I also took inspiration from Rajesh’s blog for single chunk upload(link in the last section)

The following is a diagrammatic representation of each of the 3 types of calls made to s3 API.

Pic 2: Initiate Multipart Upload

Pic 3: Upload Part

Pic 4: Complete Multipart Upload

IMPLEMENTATION IN PO

As you can see all the above communication are Async-Sync bridges. I have created 4 ICOs – 1 for Request with routing to different APIs and 3 different ICOs for the response part. All 3 of the response ICOs will write into the metadata file.

Please note that I have not accounted for any data conversion, format checks or mapping of the data in the file. Technically they are pass through scenarios, except Complete Multipart Upload, but we have to use mappings to calculate AWS API signatures.

The following configuration will need to be done beforehand:

Implement certificates, if any

Enable large message handling. This is because AWS requires all smaller files of a multipart upload, except the last one, to be 5 MB or larger.

Create ESR Objects

We have 2 different software components for source and target systems. So this scenario is from Software Component 1 to Software Component 2. Also, we will be using a common structure to write the metadata content like UploadIDs and ETags into the metadata file.

Data Types
1. Request Data Types in both software components.
2. Response Data Type in Software Component 1
3. Request Data Type in Software Component 2. CompleteMultipartUpload Data Type will not be a pass through mapping because there is actual data that is required to be sent.
4. Response Data Type in Software Component 2. The InitiateMultipartUploadResult data type is used to capture the response of both Initiate MPU and Upload Part because the UploadID and ETags are part of the header.2. Message Types
  1. Request Message Type in Software Component 1
  2. Response Message Type in Software Component
  3. Request Message Type in Software Component 2
  4. Response Message Type in Software Component 2 - I’ve changed the namespace for these structures because I didn’t want to develop code/adapter module to handle this separately. Just like the data type, this message type is common for both Initiate MPU and UploadPart API Response structures.3. Service Interfaces
    1. One outbound service for all requests from Software Component 1
    2. One inbound service for all responses from Software Component 2
    3. 3 Synchronous inbound services to send the request messages
    4. 3 outbound asynchronous services to receive the response – You will notice that we have separate services for InitialMPU and UploadPart even though the structure is the same. This is because the mappings for them are different.4. Mapping Programs – I’ve added a parameter called action because I reuse the same library code for both single chunk and multipart upload. The difference is required to calculate the AWS signature for the payload. The mappings themselves are pretty straightforward except for the signature calculation, which is handled entirely in the library. Code for AWS Signature Calculation: https://github.com/pnathan01/AWSSAPMultipartUploadLibrary
      1. Request Mapping – Initiate Multipart Upload.
      2. Request Mapping – Upload Part
      3. Request Mappings – Complete Multipart Upload – Conversion and Formatting
      4. Response Mapping - Initiate Multipart Upload – Java code to handle empty message
      5. Response Mapping – Upload Part – Java code to handle empty message received.
      6. Response Mapping – Complete Multipart Upload. The response is namespace-agnostic, causing issues in PI.
    5. Operation Mappings
      1. Request OM – Initiate Multipart Upload
      2. Request OM – Upload Part
      3. Request OM – Complete Multipart Upload
      4. Response OM - Initiate Multipart Upload
      5. Response OM – Upload Part
      6. Response OM – Complete Multipart Upload

Create ID Objects

We will create the following ID objects

1 ICO for splitting the files

1 ICO for sending out the request part of the ASync-Sync bridge

3 ICOs for receiving the response

File Split Scenario
1. Sender Channel – The file is split based on number of records because AWS MPU has a condition that all multipart files except the last one be larger than 5 MB.Here I call 2 scripts, before and after the message processing. The first one creates a file of the format <FILE_NAME>_00000000-000000-000.txt. This will be the first file called which will trigger the Initiate Multipart Upload.The second script creates the last file of the format <FILE_NAME>_99991231-235959-000.txt and Temp_Metadata.txt. This file is the one to initiate the Complete Multipart Upload. The metadata file is the one that will hold all the UploadIDs and ETags for logging, audit and monitoring purposes.I haven’t provided the script code because it is simple, and it is dependent on your OS.
2. Receiver Channel

Request Send Scenario
1. ICO
2. Sender Channel
3. Receiver Channels
  1. Channel 1 – Request to Initiate Multipart Upload
  2. Channel 2 – Request to Upload Part
  3. Channel 3 – Request to Complete Multipart Upload

Response Receive Scenario – Initiate Multipart Upload
1. ICO
2. Sender Channel
3. Receiver Channel

Response Receive Scenario – Upload Part
1. ICO
2. Sender Channel
3. Receiver Channel

Response Receive Scenario – Complete Multipart Upload
1. ICO
2. Sender Channel
3. Receiver Channel

Unit Test

I have provided only the skeleton and the basic information to get this requirement. Because there is routing to multiple interfaces in the request part, I find it difficult to implement EOIO quality in this scenario. Hence I had to introduce a mandatory delay of 600 seconds. This was calculated based on the timeit took to upload one part of the file, which was 6-7 minutes.

Also, after writing the Complete Multipart Upload, the metadata file is archived. This is a sample file to be uploaded.

A file of this size is split into multiple files of size slightly larger than 5 MB.

The metadata file contains the following data. This is written into the file by the 3 response mappings.

Conclusion

This way we don’t have to worry about files of any size to be uploaded into AWS s3. But there are a few points to be considered here:

SAP PO has no provision for greater than or lesser than in the condition editor. This prevented us from routing the file to a single chuck or multipart upload based on file size.

AWS has no central repo for the structures they use. It would’ve been easier for us that way to just download and use them. There was quite some time spent on resolving namespace issues.

Java mapping could be used to eliminate the need for separate mappings to resolve namespace issues. Also, EOIO quality for file sender to multiple receivers is possible with java mapping.

The performance of this scenario seems exponential to the number of parts it divides into. This could be optimized if we process files of different sizes based on threshold values. For example: files of size 5 MB or lesser can be uploaded in a single chunk. Files larger than that could be uploaded using multipart.

The absence of EOIO necessitates that only one file be present in the folder at a time. In case of multiple files, there will be multiple metadata files created and the process will find writing back the correct ETag values to the correct metadata file confusing.

There are also clean up APIs that need to be implemented in case a file upload fails. When we start afresh, this starts with a new UploadID. We have API to list all the current uploads and delete the ones that are incomplete. This way, we can ensure that the solution is clean.

Feel free to suggest any improvements/suggestions/feedback because this was done due to a time crunch. There can be many more ways to optimize this.

Lastly, I’ve listed the links/people I referred to during the development of this requirement.