In the first section of the SAP HANA Academy’s four part series on SAP Smart Data Access for Apache Spark Tahir Hussain ‘Bob’ Babar provides an overview on connecting SAP HANA to Hadoop HDFS using Apache Spark and shows how to install Apache Spark on Windows.
Background Information on Smart Data Access (0:35 – 1:49)
With the few most recent revisions SAP HANA has been able to connect to disparate data sources with SAP Smart Data Access. This is accomplished by creating a virtual table in SAP HANA Studio which then connects to a remote table in the remote source which is essentially a view of the data within the remote source. So within one SQL query executed within the SAP HANA box we can join data from a Microsoft SQL server to a SAP database.
In the previous SAP HANA service stack, SPS07, SAP HANA could connect directly to a Hadoop system using a driver from Simba. This federated cold data stored in Hadoop with warm data stored in SAP HANA. This combined the two data sets with a single SQL query.
Background Information on Hadoop, Map Reduce, and Apache Spark (1:49 – 5:50)
Apache Spark sits atop a Hadoop system and improves the performance of the normal maps reduce systems by up to 10-100 times.
Hadoop provides a distributed file system that stores data on commodity machines providing high aggregate bandwidth across a cluster. With a Name server data and work can be distributed across many different machines. This is called Hadoop Distributed File System (HDFS).
Apache Hive sits atop HDFS and is a data warehouse infrastructure that provides data summarization, query, and analysis. Hive SQL is used to access data within a HDFS system.
Hadoop uses a tool called MapReduce to split a job into different parts. Each job can work in parallel to generate a specific result. However at the end the result’s sets might need to be joined to different jobs and thus shuffled around. The reduce section joins the resulted data sets together for the final result. MapReduce can be slow because it is designed for long running batch processing applications. MapReduce executes jobs in a simple but very rigid structure.
A problem occurs with complex multi-stage applications that string together many high latency MapReduce jobs and then execute them in sequence. Shifting the data from one part of the MapReduce to another affects the performance.
Apache Spark is a general purpose engine which is the successor to MapReduce. Apache Spark is designed to run many more workloads in parallel than MapReduce. Spark creates execution plans for complex multi-stepped directed acyclic graphs (DAGS). Sparks executes DAGS all at once in parallel instead of one by one like map reduce. This eliminates the shuffle step and supports in-memory data sharing across the DAGS so different jobs can work with the same data at very high speed.
Prerequisites Needed to Installing Apache Spark on Windows (5:50- 9:28)
To install Apache Spark a user must have a running Hadoop 2.2 system. If not please view this video on installing a Hadoop system from the SAP HANA Academy’s from the Smart Data Access Provisioning series. Also familiarize yourself with the following three videos to learn how to use Hadoop. Loading Data into Hadoop. Configuring the ODBC Drivers. Using the Remote Data Source.
How to Startup Hadoop and Hive (9:28 – 13:55)
First start your Hadoop system and once all of the servers are up open your Hadoop Command Line. After logging into your Hadoop machine a new window where you can run Hive commands will open up.
Now connect to Hive by going to the c drive and into the hdp folder where Hive is stored. Within the Hive folder go to the bin folder and then type the command hive to launch it.
Now type show databases; to see the databases. In this example Bob has access to a pair of databases. Bob next types in show tables in live2; and then sees the four tables in his live2 database. Entering select * from live2.connections will show the data from the connections table which is a simple two column table with around 53,000 rows.
Installing Apache Spark on Windows (13:55 – 19:47)
In a web browser go to saphana.com/community/spark to see instructions on how to install Apache Spark on a Linux machine. Bob will show how to install Spark on a Windows machine. Users will need SAP HANA SPS07 or later.
From the instructions page click on the link for Apache Spark 1.0.1. Enter your personal details and agree to the software license before clicking submit. Now you will have a link to the various Spark tools. Click on the link for Apache Spark 1.0.1 to download the file to your local machine.
First extract Spark wherever you would like to install it. Bob installs Spark on his desktop but you may want to install it on the root drive of your machine.
Wherever you have install the Hadoop system go to the hdp folder, then the hive folder, then the conf folder and open the hive-site.xml file with a notepad. The file indicates the ports used and the various settings for the Hadoop Hive system. Now copy the hive-site.xml file and paste it into the conf folder of Apache Spark.
Next go into the bin directory of your Hive folder in hdb and copy the beeline.cmd file. Then paste the beeline file into the bin folder of the Spark directory. Beeline enables the running of SQL against the Hive data via JDBC.
Now Apache Spark if fully installed on your Windows machine.
Check out Bob’s video on how Apache Spark works and how to install it on a Windows machine.
SAP HANA Academy – over 500 free tutorial technical videos on using SAP HANA.
SAP HANA Academy