[SAP HANA Academy] SDA: SAP HANA Spark Controller Overview [SPS 10]
As part of the series profiling the new features for Smart Data Access for SAP HANA SPS10, the SAP HANA Academy’s Tahir Hussain (Bob) Babar walks through a series of chalkboards that provide a high level overview on how connecting SAP HANA to Hadoop has progressed over recent SAP HANA SPS releases. Bob also profiles how we now can connect to Hadoop using SAP HANA SPS10’s new SAP HANA Spark Controller. Check out Bob’s video below.
Also check out this SAP HANA Academy playlist to learn much more about SAP HANA Smart Data Access.
(1:30 – 5:50) Overview of Connecting Hadoop to SAP HANA in SAP HANA SPS07
Since SAP HANA SPS07 we have been able to connect SAP HANA to Hadoop using SAP HANA Smart Data Access. With SAP HANA installed on a Linux server we can join data between the two systems. Imagine that there is a schema with a bunch of tables in your SAP HANA system and you want retrieve a very large amount of data which is stored in Hadoop in HDFS (Hadoop distributed file system). There are a few different ways to access the data including MapReduce or Spark. These engines are used to parallel process and obtain data from large data sets. HiveQL is used to access the data in HDFS. SAP HANA Studio is used to access SAP HANA as a client.
In SAP HANA SPS07 you were able to connect to a Hadoop system from SAP HANA using SAP HANA Studio. To accomplish this you used putty or ssh to install various files (UNIXODBC Drivers and the Hive Driver) on your SAP HANA Linux server. The Hive Driver would connect to Hive on the Hadoop server which would then ultimately go through MapReduce to connect to the files on the HDFS system.
Then an end user using SAP HANA Studio could build a remote source and then a virtual table on the SAP HANA Linux server. That virtual table would then connect through the UNIXODBC and then through the Hive Driver to Hive on the Hadoop system to run the MapReduce. After this you were able to join with a single SQL statement data from SAP HANA and the Hadoop system.
This worked but was very cumbersome. Also now with the SAP HANA Cloud Platform you don’t have ready access to the Linux server where all of these SAP HANA proxies reside.
(5:50 – 7:00) Overview of Connecting Hadoop to SAP HANA in SAP HANA SPS08
In the next release, SAP HANA SPS08, instead of using MapReduce you were able to use Spark (a more updated version of MapReduce). Now after a Spark Driver was installed on the SAP HANA Linux sever you could connect through Hive to use Spark to access the data in HDFS. Due to the technology advancements of Spark over MapReduce the connectivity was much quicker. Also the connectivity path was built with a near identical process in the SAP HANA Studio.
(7:00 – 8:15) Overview of Connecting Hadoop to SAP HANA in SAP HANA SPS09
In SAP HANA SPS09 there was no need to install the UNIXODBC Driver and the Spark/Hive Driver and no work had to be preformed on the SAP HANA Linux server. Instead this new concept of a MapReduce Archive File is created with Java code in the SAP HANA Studio and then deployed on the SAP HANA Linux server. The MapReduce Archive File then connects to MapReduce in the Hadoop system and then ultimately connects to HDFS.
Another concept released in SPS09 was Virtual UDFS (User Defined Functions). With Virtual UDFS a user could connect directly to HDFS and bypass MapReduce. The user would create these objects directly in the SAP HANA Studio.
(8:15 – 10:00) Overview of Connecting Hadoop to SAP HANA in SAP HANA SPS10
Now with SAP HANA SPS10 there is no need to deploy anything from SAP HANA Studio apart from creating the remote data source. All the work is performed on the Hadoop cluster. Essentially this new piece, the SAP HANA Spark Controller, is installed, configured and assembled directly on the Hadoop cluster. You then use YARN Shuffle and a Spark Assembly to connect SAP HANA to the HDFS system.
No work needs to be done in the SAP HANA Linux server because the Hadoop system is configured to use the SAP HANA Spark Controller to connect to the remote data source in the SAP HANA Studio. The SAP HANA Spark Controller uses the same method of going through Hive, then Spark and then finally connecting to HDFS.
The next six videos in the What’s New with SAP HANA SPS10 playlist will cover how to install and configure the SAP HANA Spark Controller so you can run a single SQL statement based on data in both your SAP HANA and HDFS systems.
For over 75 tutorial videos on What’s New with SAP HANA SPS10 please check out this SAP HANA Academy playlist.
SAP HANA Academy – Over 1,200 free tutorial videos on SAP HANA, Analytics and the SAP HANA Cloud Platform.
Follow us on Twitter @saphanaacademy
can you please provide us the link for download spark controller .
For testing and demo purpose we tried to install the SAP Spark Controller on a from cloudera provided quickstart VM 5.8.0 / single-node-cluster, but was not successful with the standard installation guide.
Question: should SAP Spark Controller also work in such a cloudera quickstart VM? Ist there any steps necessary differing from standard installation guide?
Thanks a lot!