Part 1 of this series focused on how the new SAP HANA Data Lifecycle Manager features helped to utilize automated processes to tier the data from HANA to Hadoop. Now that the data is in Hadoop we will look at how to start analyzing this data with SAP HANA Vora. Before we get too far let me give you a quick high-level overview of SAP HANA Vora….
“For digital businesses that want to infuse business decisions with valuable context from new data sources, SAP HANA Vora is an in-memory query engine that plugs into the Apache Spark execution framework to provide enriched inter- active analytics on data stored in Hadoop. It lets you combine Big Data with corporate data in a way that is both simple and fast. “
Vora is a key component in the SAP Big Data Strategy for the digital enterprise. As part of the data platform Vora helps to bridge the digital divide for Analysts, Developers, DBAs, and Data Scientists. This blog will focus on the Data Scientists.
Vora will Transform Big Data analysis with an in-memory computing engine for Hadoop.
Correlate Hadoop and SAP HANA data for instant insight that drives contextually-aware decisions.
- Organize massive volumes of unstructured data into data hierarchies
- Make OLAP modeling on Hadoop data accessible to everyday users with popular SQL language
- Simplify Big Data management with data federation for SAP HANA and Hadoop landscapes
Now that you have an idea of what SAP HANA Vora is, let’s dive right in. We are going to look at how a data scientist could utilize the data that was tiered from HANA–>Hadoop.
In Part 1 we specified the DLM destination as “Spark destination”. In order for us to be able to work with the data we first need to understand a few things on how and where the information is stored in Hadoop. The destination parameter is specified for the spark controller in the hanaes-site.xml file. In the below screen shot you can see for this system the destination in HDFS is /sap/hana/hanaes/warehouse.
In part 1 we actually ran the DLM job. As a result if we look at the target HDFS destination we can see the following files that have been created by DLM.
The thing to note here is DLM creates Parquet files when relocating data to Hadoop. What is Parquet? In short …
“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”(More information on parquet here). This is a perfect match for Hana as Parquet is a columnar storage format.
There are a few options to work with Vora. Zeppelin, the Spark shell, or Beeline. At the risk of this blog getting to long, I’ll try to show an example of all 3.
What is Zeppelin?
“A web-based notebook that enables interactive data analytics.
You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.” (More information here)
Configuring Zeppelin to use the Vora interpreter takes some setup (I’ll save this for another blog if anyone is interested). For the purpose of this blog I’ll assume you have zeppelin configured and working with the Vora interpreter.
1 – Open Zeppelin in your browser
2 – Create a new note and type in your create table statement. The Vora interpreter is utilized by specifying %vora.
3 – Now that the table has been created, SQL statements can be executed against it. If you remember part 1 of the blog we relocated 100,740 records from Hana to Hadoop. We can do a count (*) on the vora table to validate all the relocated records are accounted for.
4 – At this point you are free to run any query against the Vora table. I have done a simple sum and group by as seen below.
If you don’t need to visualize what you are working on in a flashy way then you can simply use the spark shell to run and review your results.
1 – Start the spark shell. You can find it in the Vora bin directory.
2 – Import org.apache.spark.sql
3 – Create the Vora table. The syntax is the same as what we used for Zeppelin
4 – Similar to zeppelin we can run the sum select using the standard SQL syntax.
- sqlContext.sql( s”””
,SUM(cast (SMS_IN as double))as SMS_IN
,SUM(cast (SMS_OUT as double))as SMS_OUT
,SUM(cast (CALL_IN as double))as CALL_IN
,SUM(cast (CALL_OUT as double))as CALL_OUT
,SUM(cast (INTERNET_TRAFFIC as double)) as INTERNET_TRAFFIC
order by CALL_IN desc
We use Beeline, a JDBC client, to register tables created in SAP HANA Vora in the Thrift server . This method will allow SAP Lumira to connect to the Vora tables.
1 – Start beeline. This is located in the /vora/bin directory.
2 – Submit the create table statement; again this syntax is exactly the same as we used in the previous 2 examples.
3 – In order to connect with Lumira, the thrift server needs to be started. This will enable a connection from Lumira to the Vora table. The thrift server is located in the vora bin directory.
4 – Next we move to Lumira. Create a newdataset. The dataset connection to Vora is done using the Generic JDBC Drivers as seen below.
5 – Once the JDBC data source is selected we need to set the JDBC URL and the JDBC Class. The class that we utilize will be com.simba.spark.jdbc4.Driver
6 – Once we have successfully connected to the Thrift Server via the JDBC connection a list of the Vora tables should be available to query. See below the cdr_detail Vora table that we created in beeline.
7 –Standard Lumira visualizations and analytics can now be created against the Vora table.
Conclusion: This is our first look at analyzing the Hadoop data with Vora. You should now have a sense as to how Vora is bridging the digital divide for Data Scientists.
Next Steps: In part 3 of this blog series we will update our code to show how Vora is utilized to combine both Hot data in HANA and the cold data in Hadoop. The next part will really showcase how Vora can be utilized to democratize data access in the enterprise. Following that we will take a look at how we democratize and bridge the gap from the perspective of a data analyst.