This is a 3 part series of blogs covering the end to end integration scenario for SAP BW on HANA and Apache Hadoop as a Near Line Storage solution.
For part 1 : Apache Hadoop as NLS solution for SAP HANA Part 1
For Part 3: Coming Soon
After searching the internet for hours and days, trying to figure out the HANA and Hadoop integration process, I realized that there are a number of articles out there that talk about the Why’s the What’s and the Who’s of the process but not many have delved into the “How” aspect of it. So here is my humble attempt.
In order to prepare Hadoop for its use as an NLS solution, there are several key configuration steps that need to be performed. These configuration changes can be made using the Cluster Management tool such as Ambari or Cloudera Manager, or directly by editing the xml files at the OS level. The advantage of using a Cluster Management tool is that once you change a parameter or value, the tool identifies all other impacted parameters and recommends their appropriate values. Further, these tools can also identify services/roles that need to be restarted as a result of the change. Both tools also retain multiple versions of the xml, so you can play around with the configuration.
At a high level, you will need to perform the below steps to configure Apache Hadoop to integrate with BW on HANA :
- Create OS level user sap<sid> and group sap<sid>. Add the Hadoop users to the group sap<sid>
- Create HDFS Directory to store archived data with the right ownership and permissions for user sap<sid>
- Create a Yarn queue using the “CapacityScheduler.xml” which can be used by the WebHCat to kick off Hive jobs then assign the queue in Hive with the appropriate parameter for Templeton Jobs.
- Create a Hive Schema/DB with the name sap<sid>
- Configure Yarn Memory parameters. This is an important step and you should consider studying he distribution document to configure Yarn. The ability of the spark controller and WebHCat to kick off hive jobs depends on how you configure Yarn memory parameters specifically for container cpu and memory.
- Check relevant SAP Notes, to ensure your configuration is up to date and as per SAP recommendations.
Please follow the section “Configuration Steps > Apache Hadoop” from SAP Note 2363218 to configure the above key components in your Hadoop Cluster.
I would like to share some tips on my learning in this space for configuring the above pieces.
HDFS and WebHDFS:
Pro tip: You can test the WebHDFS by browsing the file system through either Ambari or Cloudera Manager, or as described in here .
Pro tip: SAP Spark Controller when running, launches up to 6 containers and can reserve a couple more. Make sure your configuration is in line with available hardware resources. Additionally, YARN configuration does not take into account available HW resources and that will be your responsibility.
Pro tip: Proxy User settings are key here. Make sure you configure the parameters hadoop.proxyuser.hive.hosts and hadoop.proxyuser.hive.groups since the user hanaes should be able to impersonate sap<sid> user to be able to write and read from the HDFS directory using HIVE as well as create tables in the sap<sid> database/schema in HIVE.
Pro tip: WebHCAT and HCatalog are installed with Hive, starting with Hive release 0.11.0.No additional steps are needed. You can test WebHCAT using the HTTP Connection you create in SAP, or as below:
http://<webhcat host fqdn>:50111/templeton/v1/status
SAP Spark Controller:
SAP Spark Controller is the adapter framework that connects SAP HANA to the Hive Database and also allows SAP to use Spark to get to the HDFS persistent store.
Please refer to the SAP Help Section for Installing and Configuring SAP Spark Controller here . You can install the Controller using one of the Cluster Management tools or manually. There are a few bugs related to stopping and starting the controller using the cluster management tools, but for the most part, the installation, configuration and management is fairly straight forward. Pay close attention to the “Prerequisites” since the Spark Assembly Jar file and the proxy user settings are key.
In order to successfully integrate Hadoop as an NLS solution for SAP BW, you will need to ensure that these key components are configured correctly and have the necessary properties defined.
Creating the Remote Source in HANA:
You can create a remote source in SAP HANA, using the SAP HANA Studio or an SQL Console to connect SAP HANA to the SAP Spark Controller.
This concludes Part two. In the next blog, I will elaborate on the BW side configuration along with creating the Data Archiving Process for an ADSO in BW.
Coming up: Apache Hadoop as NLS solution for SAP HANA Part 3