Your SAP on Azure – Part 23 – Archive SAP data to Azure Data Lake
UPDATE! As Microsoft announced the ADLS Gen 1 will retire in 2024 I can’t recommend this solution any longer. The ADLS Gen 2 currently doesn’t support WebHDFS protocol. If you interested in data archiving I recommend checking ILM extension that allows connecting directly to blob store:
You’ve probably noticed that I like to play around with SAP and Azure. There are so many possibilities to integrate both solutions and I truly believe that Microsoft’s cloud platform is just the best way to host your SAP landscape. In a few previous posts, I focused on the Azure Data Platform and how to unleash information stored in the system using services like Data Lake, HDInsight and Databricks. Just recently I thought that the cheap cloud storage surrounded by advanced analytics tools could also be a perfect place for the archived data.
The topic of data management is especially important for customers who decided to migrate to SAP HANA. The RAM memory is significantly more expensive than persistent storage and I think every customer who already runs the SAP HANA database or plans to implement it should review his data archiving strategy. Offloading old and rarely accessed data to external storage reduces the cost of running SAP HANA and will improve the performance of any other database. It should be on your priority list if you consider moving to SAP S/4HANA in the near future. The question is not if you should implement data archiving but how to do it best!
There are many ways how you could integrate Azure in the data archiving process. A basic integration could relay on synchronizing the files between local persistent storage and the cloud. But it’s not a perfect solution as you’d have to maintain two archive stores (local and cloud) and ensure there are no replication problems. I was looking for a more elegant solution.
The SAP ILM Store, which is part of the SAP ERP and SAP S/4HANA, has the functionality I was looking for. It’s tightly integrated with SAP and allows you to offload data to external storage. Customers use it in conjunction with the SAP IQ database, but I don’t think many people know that starting from the SAP NetWeaver 7.40 SPS13 it is also possible to use the Hadoop as the target through a WebHDFS interface which is supported by the Azure Data Lake.
There is just one part missing. The archived data must be readable for external, non-SAP applications. To fulfill this requirement, I’m using the Data Archiving Service that is part of AS Java and allows us to save files in XML format.
In my landscape, I’m using the same system as the data source and ILM Store, but for larger implementations, you can connect multiple systems.
This is quite a long guide, so let’s get started!
Usually, I’m trying to describe every single activity to make the solution working. This time there are a few dependencies that I decided to exclude from the post to make it more readable. Before starting the configuration please ensure you have:
- SAP ERP or S/4HANA system available – required to configure the ILM Store functionality. In my landscape, the same system will be the source of data to be archived. All steps required to activate and configure the ILM Store functionality and integration with the Azure Data Lake are covered in this blog.
- ABAP SDK for Azure – we’re going to use the ABAP SDK to authenticate against Azure AD. It is not a mandatory part, but I highly encourage you to install it. If you know ABAP you could write your own authentication implementation, but why if someone has done it for us already?
- SAP JAVA system – which transforms data coming from the ERP system to XML format
- Azure Data Lake Storage Gen1 – a storage solution for the archived data
You can find many guides on how to install the above components, so I don’t think it makes sense to repeat already available content. Once all resources are prepared we can start the configuration.
CREATE SERVICE PRINCIPAL
The very first activity we need to perform is to create the Service Principal in Microsoft Azure. We will use it as a “service user” to operate on Azure Data Lake service, for example, to upload or download files. Go to the Azure Active Directory blade and select App Registrations. Click on +New Registrations to create a new Service Principal:
Select and enter the name of the Service Principal and leave the default settings. Save your entries:
To authenticate against Azure AD we’ll require Client ID and Client Secret as well as the OAuth endpoint. You can find the first value on the main screen of the Service Principal:
To create the key select the Certificates and Secrets entry on the menu. Create a new secret and copy the assigned value. When you close the blade you won’t be able to retrieve it again.
To display the OAuth endpoint address, click on the Endpoints button on the Overview page. We’ll need it later together with the Client ID and Secret during the ABAP SDK configuration.
We need to assign Azure Data Lake permissions to the Service Principal to ensure it can access the service. Select API Permissions from the menu and then add permission for Azure Data Lake:
To finish the configuration of the Service Principal, assign file permissions on the storage level. I created a separate directory SAPArchive where I’d like to keep all archived data. Assign Read/Write/Execute permissions to the selected directory. In addition, set the Execute access to the root directory of the Azure Data Lake.
That was the last step required to configure a new Service Principal. Now let’s focus on the ILM Store.
PUBLISH ILM STORE
Before we enable the ILM Store functionality in the SAP I’d like to shortly explain why we need it. When you use XML Data Archiving Service available in the AS Java you have two options to save the files. You can either choose to store them locally in the selected directory or you can publish them through the WebDAV protocol. You probably know the basic HTTP methods like POST, GET, UPDATE and DELETE. They correspond to CRUD operations known in databases and let you create, read, update and delete information from the remote source. The WebDAV protocol extends the list of available request methods for advanced content authoring operations and it allows us to work directly on documents stored on a remote server. It’s quite popular and most of the operating system has built-in native client functionality to access WebDAV server (including Windows Explorer).
The ILM Store is SAP implementation of the WebDAV protocol and it acts as an interface between the system and remote storage as the IQ database or Hadoop filesystem. There are many other 3rd party WebDAV servers available, but only a small subset is certified to work correctly with the SAP systems. Initially, I thought there would be a native Azure functionality that I could use instead of the ILM Store, but that was just wishful thinking. I could probably create a virtual machine with Microsoft IIS Server and expose the WebDAV interface this way, but such a solution wouldn’t be certified and it still wouldn’t allow me to save files directly to Azure Data Lake. As the ILM Store has a dedicated connector to Hadoop filesystem over WebHDFS protocol and I know a few customers who actively use the ILM Store I decided to focus on the native SAP solution.
You can enable the ILM functionality in the ERP or S4HANA system by activating ILM and ILM_STORE Business Functions in transaction SFW5:
To access the ILM Store and save files through the WebDAV interface we need a service user with proper authorizations. Create a new role ZILM_ADM with following objects:
- Authorization object: SILMSTOR
Activity: 16 (Execute)
- Authorization object: SILMSTORAD
Activity: 02 (Change), 07 (Activate, generate), 39 (Check)
- Authorization object: S_DATASET
File name: *
Program: CL_ILM_STOR_DATASET_CP and RILM_STOR_PUT_WORKER
Activity: 06 (Delete), 33 (Read), 34 (Write)
- Authorization object: S_DEVELOP
Activity: 07 (Activate), 40 (Create in DB)
- Authorization object: S_CTS_ADMI
- Authorization object: S_CTS_SADM
Create a new service user ILM_SVC and assign the created role.
When the user is ready we can create an endpoint to access the ILM Store. Go to SICF, expand path /default_host/sap/bc/ilm and create a new service. On the Handler List tab provide the handler class CL_ILM_STOR_WD_REQUEST_HANDLER. Next, switch to the Logon Data tab and enter credentials of the service user.
The ILM Store is now available!
Almost entire configuration of the SAP ILM depends on the Origin, which is also referred to as the Data Source. It can be an individual system, a specific client, or an archiving object. Based on the Origin you can create different routings and set processing rules depending on the source of the data. It is especially important in large landscapes where there is a central ILM system. For the testing purposes, SAP delivered a preconfigured origin archeb which I’m going to use it this guide. It contains a reference customizing that we just need to adjust to make it work with Hadoop.
The table TILM_STOR_O_ROUT lets us link the source of the data with the origin. Open transaction SM30 and create a new entry to reference your System ID with the Data Source archeb. Leave the other fields empty.
The initial configuration of the ILM Store is now completed and we can move to setting up the Hadoop connection.
CONFIGURE OAUTH AUTHENTICATION WITH AZURE
SAP ILM can communicate with the Hadoop filesystem through WebHDFS protocol which is available for Azure Data Lake Gen 1. We can use it as the target storage for the archived data but we need to create an additional authentication method. By default, the ILM Store supports basic authentication which due to security concerns is not available in Azure. We can use the OAuth instead.
SAP allows implementing an additional authentication method using Enhancement Spot. Instead of writing the entire logic from scratch I decided to reuse the ABAP SDK for Azure which comes with prebuilt classes that support OAuth. To install the ABAP SDK, follow the document available here.
Once the ABAP SDK for Azure is available, go to transaction SM59 and create a new connection type G using the Azure AD token endpoint address collected when creating Service Principal:
In the Login and Security tab activate the SSL and select the certificate:
If you try to test the connection now, it will fail as the certificate of the remote host is not known for your SAP system. To solve that issue we need to download the certificates from the endpoint and import them in the STRUST transaction. Open the web browser and enter the Azure AD endpoint address configured in RFC Destination. Click on the lock icon next to address bar to display the certificate:
Export all three certificates to file using Copy to File option available on the details tab. Once they are exported go to transaction STRUST and open the PSE selected during connection setup. Import all three certificates to the certificate list.
Now, the connection test in SM59 is successful.
ABAP SDK for Azure requires a bit of configuration. We start by maintaining the table ZREST_CONFIG. Define a new communication interface to Azure AD and link it with the previously created RFC destination.
In table ZREST_CONF_MISC define additional interface settings. Set the HTTP Method to POST and enter the maximum connection retrials in case of connectivity error in the field Counter. You can also provide an e-mail address to receive notification in case of communication failure.
In table ZADF_CONFIG enter the Client Secret generated during the Service Principal creation, select the interface type to Azure Active Directory and set Synchronous as the Call type. Ensure you use the Interface ID defined in the earlier step.
The ABAP SDK for Azure comes with a report that let us verify the connectivity. We just need to slightly adjust it to cover our setup and requirements.
Edit the ZADF_DEMO_AZURE_AAD report and implement three changes:
- Change the interface name in line 13
- Enter the service principal Client ID in line 33
- Set the resource to https://datalake.azure.net/ in line 34
Save your changes and run the report. As the outcome you should see the authorization token.
By default, ILM implements two authentication methods.
– NO_AUTHENTICATION – the name is a bit misleading as you can use this method when the Hadoop cluster supports basic authentication or when it doesn’t require authentication at all.
– PSEUDO_AUTHENTICATION – in addition to NO_AUTHENTICATION this method provides an additional parameter that is sent together with every request.
To enable the OAuth authentication, we need to create a BADI Implementation that will be automatically called whenever the ILM Store access Azure Data Lake.
Register a new authentication method ZOAUTH_AZURE in the table TILM_STOR_HDP_A:
Create the BADI Implementation using transaction SE19. Enter ILM_STOR_HDP_AUTHENTICATE in the Enhancement Implementation field and click Change. Fill the popup window with following values:
BAdI Definition: ILM_STOR_HDP_AUTHENTICATE
BAdI Implementation: ZHDP_OAUTH_AZURE
Implementing Class: ZCL_ILM_STOR_HDP_OAUTH_AZURE
Short text: ILM Store: OAuth with Microsoft Azure
The filter criteria are used to determine which BADI Implementation should be called based on the customizing we’ll provide later. Enter ZOAUTH_AZURE in the Value field and AUTHENTICATION_METHOD as the Filter.
Based on the report ZADF_DEMO_AZURE_AAD which we used earlier to test the connection I was able to create the method to retrieve the authentication token.
Please remember I’m not a programmer! I did my best to make it working, but I’m also aware there are possible improvements. With the below code, every write operation is preceded by requesting a token, which is not optimal (and affects the solution performance). If you plan to use below code, please spend some time to optimize it.
Adjust lines 9, 28 and 29 with your settings:
*&---------------------------------------------------------------------* *& Below implementation is a copy of the ZADF_DEMO_AZURE_AAD report to get authentication token from Azure AD *& Thanks to the ABAP SDK for Azure for creating this template *&---------------------------------------------------------------------* method IF_ILM_STOR_HDP_AUTHENTICATE~AUTHENTICATE. DATA: gv_interface_aad TYPE zinterface_id VALUE 'ADL', gv_message_bid TYPE zbusinessid VALUE 'TEST_AAD', gv_string TYPE string, oref TYPE REF TO zcl_adf_service, oref_aad TYPE REF TO zcl_adf_service_aad, cx_adf_service TYPE REF TO zcx_adf_service, cx_interface TYPE REF TO zcx_interace_config_missing, cx_http TYPE REF TO zcx_http_client_failed, gv_aad_token TYPE string, gv_response TYPE string. **Calling Factory method to instantiate AAD client oref = zcl_adf_service_factory=>create( iv_interface_id = gv_interface_aad iv_business_identifier = gv_message_bid ). oref_aad ?= oref. TRY. CALL METHOD oref_aad->get_aad_token EXPORTING iv_client_id = '556e443e-7406-4f30-a345-9742da13a948' " Input client id as per implementation guide for AAD iv_resource = 'https://datalake.azure.net/' "Resource for Azure Key vault application IMPORTING ev_aad_token = gv_aad_token ev_response = gv_response. CATCH zcx_interace_config_missing INTO cx_interface. gv_string = cx_interface->get_text( ). MESSAGE gv_string TYPE 'E'. CATCH zcx_http_client_failed INTO cx_http . gv_string = cx_http->get_text( ). MESSAGE gv_string TYPE 'E'. CATCH zcx_adf_service INTO cx_adf_service. gv_string = cx_adf_service->get_text( ). MESSAGE gv_string TYPE 'E'. ENDTRY. IF NOT gv_aad_token IS INITIAL. CONCATENATE 'Bearer' space gv_aad_token INTO gv_aad_token RESPECTING BLANKS. io_connector->get_request( )->SET_HEADER_FIELD( name = 'Authorization' value = gv_aad_token ). ENDIF. endmethod.
Activate the changes. The new authentication method is now available. We can jump to the AS Java instance and enable XML Data Archiving Service.
CONFIGURE XML DATA ARCHIVING SERVICE
The XML Data Archiving service, which is part of the AS Java, is required to transform archive data to XML documents which are then stored in Azure Data Lake. Storing the files in open format has a lot of advantages – the biggest one is that you can use them with Azure Data Platform services.
There is a wizard available to perform the initial configuration of the XML Data Archiving Service. Log in to the SAP NetWeaver Administrator and open the configuration of functional units. Select XML Data Archiving Service and click Enable Automatically.
The wizard guides you through the configuration. Once the process is completed you will see a summary. Remember the service usernames and password as we’ll need them later to configure the RFC connection from the ABAP system. In case of issues please review logs and correct your settings.
Check the XML Data Archiving Service is running. Open NetWeaver Administration and go to Operations -> Data and Databases -> XML DAS Administration.
We need to cross-reference the systems by creating respective destinations. Let’s start with the AS Java. Create a new destination and point it to the ILM service we created earlier and provide the ILM Store service user credentials. You can skip testing the connection as it will result in error anyway. We will verify it later.
Go back to the ABAP system and create the RFC connection pointing to XML Data Archiving Service. Enter your AS Java hostname and port. Provide /DataArchivingService/DAS in the Path field.
Move to the Logon & Security tab. Enter the communication user data that was created during the automatic configuration of the Data Archiving Service.
Save your settings and test the connection. When you switch to the Response Body you should see the main XML DAS page.
Enter the created connection name in the configuration of the Data Archiving Service in the ABAP instance. Open SPRO and go to SAP NetWeaver -> Application Server -> System Administration -> Data Archiving -> XML-based Archiving -> Configuration of XML DAS. Enter the RFC name and check the Active field.
Save your settings. In the same SPRO node run the Synchronization of Home-Paths for XML-based Archiving to create Archive Hierarchy in the AS Java. Start with running the report in the test mode and if no errors are reported, repeat the operation in the production mode.
Go back to the XML Data Archiving Service to verify the hierarchy is actually created.
Go back to the XML Data Archiving Service to register the Archive Store in the AS Java. Open Archive Store Management tab and click on the Define button. Type the desired Archive Store name. Set WebDAV as the Store Type and select the destination pointing to the ILM Store we created previously. Save your settings and verify the connection is working by clicking Test.
The last activity that you need to perform in the AS Java is to assign the Archive Store to the Archiving Object. Go to the Archive Hierarchy tab and select the archiving object you want to configure. For the purpose of this blog, I’m using the BC_SBOOK_X object. Choose the previously created Archive Store and click Assign.
We finished the configuration of the XML Data Archiving Service on the AS Java. The next section will guide you through the outstanding ILM Store Hadoop configuration.
CONFIGURE ILM HADOOP CONNECTOR
We’re almost done!
In this section, we focus on the ILM Store Hadoop configuration. As the first activity create a new RFC Destination pointing to your Azure Data Lake store. The hostname is your Data Lake URL. Port should be set to 443 and the path point to the WebHDFS interface /webhdfs/v1/. Activate the SSL on the Logon and Security tab and leave other settings as they are.
Verify the connection is working. In case you receive the SSL Handshake error, you need to add the endpoint certificate in the same way we did it for the OAuth token.
In the section Publish ILM Store, we started the configuration of the SAP ILM framework. By filling the routing table, we defined a link between the source system and the origin. Now we’ll set the processing rules for the archeb origin and we instruct the system to save all data coming from there to the Data Lake. I change only the customizing related to the Hadoop connector leaving all other values as they were.
Firstly, we need to define a new connection that should be used for origin. The entry should already exist in the system but with an incorrect value. Open the maintenance of table TILM_STOR_CUS and adjust the record:
Data source: archeb
Value (Connection ID): ADL_ARCHIVE
In the same maintenance view create a new entry to point the Connection ID to the Connection Type HADOOP. It tells the system which interface to use. In the next step we’ll link the Connection Type with the implementation class that will process archive data. Replace the ADL_ARCHIVE in the property field if you decided for a different connection name in the last step:
Data source: archeb
Value (Connection ID): HADOOP
In the customizing view TILMSTOR_CF reference the HADOOP connection type to the implementation class. Create a new entry with following values:
API Version: Default
Parameter: HADOOP (adjust it if you decided for a different Connection ID)
Implementing class: CL_ILM_STOR_DB_OPERATOR_HDP
Now we need to provide general settings that don’t depend on the origin. Open maintenance view TILM_STOR_HDP_C to reference the Authentication Method created earlier to the Connection ID. I couldn’t find the correct value for the Max Data Size in Azure Data Lake, so I’m using the one that was provided as an example in the ILM Hadoop Connector Configuration Guide. 2.147.483.648 bytes equals 2 GiB so it seems about right.
Connection ID: ADL_ARCHIVE
Max Data Size: 2.147.483.648
Authentication Method: ZOAUTH_AZURE (adjust it if you provided a different name in view TILM_STOR_HDP_A)
In the origin-dependent customizing view TILM_STOR_HDP_CO we need to link the origin with the configured Connection ID and RFC Destination:
Connection ID: ADL_ARCHIVE
Data Source: archeb
RFC Destination: ZILM_STORE_ADL
In the last customizing table, we just need to adjust the path where the data will be stored. When I deployed the Azure Data Lake service, I created a new directory where I would like to keep my files. The default value is /tmp/ so it doesn’t sound right for the data archive ? You can change the path in transaction FILE – adjust the Physical Path to the selected directory name.
Logical Path: ILM_STOR_ADK_ROOT_FOLDER
Syntax Goup: UNIX
Physical Path: SAPArchive/<FILENAME>
That’s everything! The entire configuration is complete and we can start testing!
We executed all required configuration steps and the ILM integration with Hadoop is now complete. We can offload outdated information from the SAP system to remote Azure Data Lake service. It’s tempting just to start the archiving process, but before let’s run three validation reports.
To validate the overall SAP ILM settings, I use the RILM_STOR_TEST_HC_SERVER report. It executes multiple checks and highlights incorrect configuration. However, it’s not the best way to verify the connection to the Azure Data Lake store. One of the WebHDFS operations that the report uses is not supported with the Data Lake. Therefore, you will get the error saying it can’t connect using the Hadoop connector. You can ignore the error – my tests didn’t indicate the missing operation has any influence on the actual archiving process. I assume it was implemented just for testing purposes. You can also skip errors related to operating system commands. The ILM Hadoop connector uses HTTP calls you don’t have to implement them.
To validate the Hadoop connection there is another report RILM_STOR_TEST_HADOOP. It performs a set of operations to ensure the files can be stored in the remote storage. Enter your Connection ID and Type on the selection screen.
The report shouldn’t highlight any issue.
Finally, we can also check the performance of the solution. SAP prepared an additional report for this purpose. But before we can run it we need to prepare a new destination in SM59 pointing to the ILM Store endpoint (I promise, it’s the last one!).
Execute the report RILM_STOR_TEST_PF_REPEAT_SMALL:
The report creates the directory structure and generates a set amount of documents that are pushed to the ILM Store and Azure Data Lake. I found it difficult to compare the results, as there are none publicly available KPIs that should be met. Therefore I think the best way is to use the example provided in the ILM Store configuration guide:
SAP ILM Configuration Guide results:
If during subsequent runs the report displays errors “MKOL – 405 Method Not Allowed” don’t worry. It is an obvious bug as the report doesn’t delete the entire collection structure. When you execute the report again some directories already exist which results in 405 type errors.
As always I highly encourage you to perform the Proof of Concept to ensure the solution is working properly in your landscape.
This is the moment we’ve been waiting for. After a long configuration we can finally start the archiving process to see how the solution behaves in real life. Using the report SAPBC_DATA_GENERATOR I generate test SFLIGHT data. I went with the minimum amount of records as it should give a reasonable amount of data to archive.
The archiving objects I use is BC_SBOOK_X. It’s pre-delivered by SAP and we can use it to verify the Hadoop integration. Enter the transaction SARA and select the archiving object.
Select Write to run the archiving run and transfer data to Data Lake. I created a new variant to include all flights assigned to airline Alitalia (AZ).
When you start the process SAP system triggers a data archiving job. Based on provided criteria the system found 1878 objects to archive.
It takes around 10 minutes to complete. In the meantime, you can look at the Azure Data Lake to see files are actually being created!
When the archive job is completed you can remove the records from the database. If you need to access the data later, you can use the read program:
The archiving process is completed, and I could stop my post here. But there is one more thing I’d like to show you. In the introduction I wrote that storing data in the Azure Data Lake gives you access to the entire data platform. For example, you can use the Azure Databricks to easily read the files:
val df = spark.read.format("xml").option("rootTag","SBOOK").load("/mnt/SAPArchive/a4h/100/bc_sbook_x/000002/*.xml") df.show()
This way, the archived data are still a relevant source for reporting. Even after offloading, they don’t lose their value – moreover, you can now use it together with data coming from other sources. But that’s just an example. I encourage you to experiment with other Azure services and use the archived data. Who knows, maybe you’ll discover things that were previously hidden?