Apache Hadoop as NLS solution for SAP BW/HANA Part 1
This is a 3 part series of blogs covering the end to end integration scenario for SAP BW on HANA and Apache Hadoop as a Near Line Storage solution.
For part 2 : Here
For Part 3: Coming Soon
Introduction:
Apache Hadoop has become the poster child for big data largely due to its high scalability analytics platform capable of processing large volumes of structured and unstructured data. SAP HANA on the other hand, has gained ground as the leading in memory data analytics platform that lets you accelerate business processes and deliver quantifiable business intelligence at lightening speed. Both these database platforms are independent of each other and have pros and cons which make them a perfect fit for a long term sustainable high performance data lake strategy for any large multinational corporation.
This blog is intended as a walk through in implementing Apache Hadoop as a Near Line Storage Solution for SAP HANA leveraging SAP Spark Controller. For the sake of this blog, we will work with the below versions and software products :
SAP BW 7.5 SPS 5
SAP HANA 1.0 SPS 12 and higher
Core Apache Hadoop version 2.7.1 or higher (HDFS, MapReduce2,YARN)
Tez 0.7.0 as execution engine for Hive (Instead of MapReduce 2 if preferable)
Spark 1.5.2 or higher
SAP HANA Spark Controller 2.0 SP01 Patch 1 or higher
SAP recommends these as the base line requirements but I have come to believe through experience that these versions work very well with each other in terms of dependencies and interoperability. Both Hortonworks (HDP) and Cloudera (CDH) provide packaged Apache platforms which provide the above versions. I personally did not work with MapR so I am not aware if they do as well, but I am sure there should be something available through them which works well together with SAP.
Hadoop Cluster Architecture and Sizing:
In case you are considering a POC, my recommendation would be to go with at least a 3 node Hadoop cluster with 1 Namenode and 2 Datanodes. This will give the administration team a good feel of the production cluster in terms of setup and administrative duties. Apache Hadoop is a multi-component solution and as such, the fine tuning and configuration aspects are fairly diverse and yet interdependent. The overall architecture is depicted below at a high level.
Hadoop 3 Node Cluster:
SAP BW on HANA, Spark Controller and Hadoop:
Near Line Storage:
Please refer to vendor documentation for Hadoop sizing. For the sake of reference, below is the cluster sizing that I used for the proof of concept:
We went with a virtualized cluster for the POC.
Hadoop Installation:
Depending on the flavor of Hadoop chosen for the POC, you can install the Hadoop cluster through Apache Ambari or through the Cloudera Manager. The links for the detailed step by step installations are below:
Apache Ambari
https://ambari.apache.org/1.2.1/installing-hadoop-using-ambari/content/ambari-chap1.html
Cloudera Manager
http://www.cloudera.com/documentation/cdh/5-1-x/CDH5-Installation-Guide/CDH5-Installation-Guide.html
Coming up: Apache Hadoop as NLS solution for SAP HANA Part 2
Thank for share!
You are welcome Linda!
Hi Shantanu , how are you storing SAP BW Relational data into HADOOP , are you using HBase to store that relational data into Column oriented data stores?
What criteria you are using to store warm and cold data into Hadoop.
Keen to know in next parts too
How is relational integrity maintained in Hadoop.?
Hi Aniruddha,
Thank you for your question. It is important to understand that Hadoop is not a relational database. It is in essence a file system storage. The relational nature is achieved through add-on products such as Apache Hive, Spark etc. The HDFS files, are linked to the BW ADSO through a HIVE table (metadata) and a Virtual View in HANA. I have update the link to the second part in the above blog which delves a bit into this topic.
For determining which data is hot vs warm vs cold, this is driven by business. The idea is to segregate data based on reporting needs.
Hi,
I have added your Guidance to the SAP NLS Blog unter Implementation - https://blogs.sap.com/2016/10/12/sap-nls-solution-sap-bw/#implementation
Thanks and Best Regards
Roland Kramer, PM SAP EDW, SAP SE
Thank you Roland!. Hope this helps others! I will be adding Blog number 3 soon.
thank you for share
nice blog! would it be possible to set up the szenario for testing / demo purpose also on a from cloudera provided quickstart VM / on a pseudo distributed single-node-cluster? Or is an actual cluster installation necessary?
Asking because I tried to install SAP Spark Controller on such a clouder quickstart VM but was not successful. At least not with the standard SAP installation guide.
Thanks a lot!
Had to move this to Linkedin since my account was deactivated.
https://www.linkedin.com/pulse/apache-hadoop-nls-solution-bw-sap-hana-shantanu-shaan-sardeshmukh/