Skip to Content

This is a 3 part series of blogs covering the end to end integration scenario for SAP BW on HANA and Apache Hadoop as a Near Line Storage solution.

For part 2 : Here

For Part 3: Coming Soon

Introduction:

Apache Hadoop has become the poster child for big data largely due to its high scalability analytics platform capable of processing large volumes of structured and unstructured data. SAP HANA on the other hand, has gained ground as the leading in memory data analytics platform that lets you accelerate business processes and deliver quantifiable business intelligence at lightening speed. Both these database platforms are independent of each other and have pros and cons which make them a perfect fit for a long term sustainable high performance data lake strategy for any large multinational corporation.

This blog is intended as a walk through in implementing Apache Hadoop as a Near Line Storage Solution for SAP HANA leveraging SAP Spark Controller. For the sake of this blog, we will work with the below versions and software products :

SAP BW 7.5 SPS 5

SAP HANA 1.0 SPS 12 and higher

Core Apache Hadoop version 2.7.1 or higher (HDFS, MapReduce2,YARN)

Tez 0.7.0 as execution engine for Hive (Instead of MapReduce 2 if preferable)

Spark 1.5.2 or higher

SAP HANA Spark Controller 2.0 SP01 Patch 1 or higher

SAP recommends these as the base line requirements but I have come to believe through experience that these versions work very well with each other in terms of dependencies and interoperability. Both Hortonworks (HDP) and Cloudera (CDH) provide packaged Apache platforms which provide the above versions. I personally did not work with MapR so I am not aware if they do as well, but I am sure there should be something available through them which works well together with SAP.

 

Hadoop Cluster Architecture and Sizing:

In case you are considering a POC, my recommendation would be to go with at least a 3 node Hadoop cluster with 1 Namenode and 2 Datanodes. This will give the administration team a good feel of the production cluster in terms of setup and administrative duties. Apache Hadoop is a multi-component solution and as such, the fine tuning and configuration aspects are fairly diverse and yet interdependent. The overall architecture is depicted below at a high level.

 

Hadoop 3 Node Cluster:

 

 

SAP BW on HANA, Spark Controller and Hadoop:

 

 

 

Near Line Storage:

 

 

 

Please refer to vendor documentation for Hadoop sizing. For the sake of reference, below is the cluster sizing that I used for the proof of concept:

We went with a virtualized cluster for the POC.

  

Hadoop Installation:

Depending on the flavor of Hadoop chosen for the POC, you can install the Hadoop cluster through Apache Ambari or through the Cloudera Manager. The links for the detailed step by step installations are below:

Apache Ambari

https://ambari.apache.org/1.2.1/installing-hadoop-using-ambari/content/ambari-chap1.html

Cloudera Manager

http://www.cloudera.com/documentation/cdh/5-1-x/CDH5-Installation-Guide/CDH5-Installation-Guide.html

 

Coming up: Apache Hadoop as NLS solution for SAP HANA Part 2

To report this post you need to login first.

7 Comments

You must be Logged on to comment or reply to a post.

  1. ANIRUDDHA SHINDE

     

    Hi Shantanu , how are you storing SAP BW Relational data into HADOOP , are you using HBase to store that relational data into Column oriented data stores?

    What criteria you are using to store warm and cold data into Hadoop.

    Keen to know in next parts too

    How is relational integrity maintained in Hadoop.?

    (0) 
    1. Shantanu Sardeshmukh Post author

      Hi Aniruddha,

      Thank you for your question. It is important to understand that Hadoop is not a relational database. It is in essence a file system storage. The relational nature is achieved through add-on products such as Apache Hive, Spark etc. The HDFS files, are linked to the BW ADSO through a HIVE table (metadata) and a Virtual View in HANA. I have update the link to the second part in the above blog which delves a bit into this topic.

      For determining which data is hot vs warm vs cold, this is driven by business. The idea is to segregate data based on reporting needs.

       

      (0) 

Leave a Reply