Smart Data Access with HADOOP HIVE & IMPALA
“SAP HANA smart data access enables remote data to be accessed as if they are local tables in SAP HANA, without copying the data into SAP HANA. Not only does this capability provide operational and cost benefits, but most importantly it supports the development and deployment of the next generation of analytical applications which require the ability to access, synthesize and integrate data from multiple systems in real-time regardless of where the data is located or what systems are generating it.”
Reference: http://help.sap.com/hana/Whats_New_SAP_HANA_Platform_Release_Notes_en.pdf Section 2.4.2
Currently Supported databases by SAP HANA smart data access include:
- Teradata Database: version 13.0SAP
- Sybase IQ: version 15.4 ESD#3 and 16.0
- SAP Sybase Adaptive Service Enterprise: version 15.7 ESD#4
- Intel Distribution for Apache Hadoop: version 2.3 (This includes Apache Hadoop version 1.0.3 and Apache Hive 0.9.0.)
Also Refer to:
SAP Note 1868209: Additional information about SPS06 and smart data access
SAP Note 1868702: Information about installing the drivers that SAP HANA smart data access supports
UPDATE (Dec 04 2013): As of SPS07 Hortonworks HDP1.3 (When’s HDP 2.0 coming?) appears to have been added to the official list, and remote caching of HADOOP Sources has been added, which should hopefully speed queries up for those tables in HADOOP that aren’t changing frequently.
UPDATE (Jan 29 2014): SAP HANA Academy now has a great collection of videos using Smart Data Access. Thanks SAP 🙂
Using Smart Data Access (SDA) with HADOOP seems to me a great idea for balancing the strengths of both tools. Unfortunately for real-time responsiveness HIVE SQL currently isn’t the most optimal tool in HADOOP [instead it’s better used for batched SQL commands]. Cloudera’s Impala, Hortonworks Stinger initiative and MapR’s Drill are all trying to address real-time reporting.
I’ve only tested Impala so far, but I’ve noticed speeds of 10 to 100 times improvement over standard HIVE SQL queries. With that in mind I thought it would be interesting to test them both in HANA using SDA.
Unfortunately I’m using Cloudera’s open-source Apache Hadoop distribution (CDH), which isn’t on SAP’s approved list yet. However since SDA uses ODBC I’ve managed to get it working using a third party ODBC driver from Progress|DataDirect. http://www.datadirect.com/products/odbc/index.html
NOTE: Since CDH is not currently on this list I’m sure SAP will NOT recommend you using this in a production environment. If you do though get it working in a sandbox environment why not help by adding your voice for it be certified and added to the ‘official’ list.
With the disclaimers out of the way this is how SDA works.
Remote Data Sources:
Once you have your ODBC drivers install properly Remote Sources can be added for both HIVE and IMPALA
Expanding the Remote Sources shows the tables that can be access by HANA.
NOTE: For me expanding the HIVE1 tree takes almost 20 seconds each time expanding a node (perhaps it uses mapreduce?), IMPALA1 nodes in the hierarchy expanded quickly.
In the above screen shots you will notice that both HIVE1 & IMPALA1 share the same tables as they use the same HADOOP metastore. Data is NOT replicated in HIVE tables and IMPALA tables. The metastore just points to the tables files location within the HADOOP ecosystem, whether stored as text files, HBASE tables or column store PARQUET files (to list just a few).
There are some tables types (file types) that can only be read by HIVE or IMPALA, but there is a large overlap and this may converge over time.
Select Create virtual tables, from your Remote Source, in the schema of your choice.
NOTE: I’ve previously created an ‘HADOOP’ schema in HANA to store these virtual tables.
Once created you can open the definition of the new virtual tables, as per normal HANA tables.
Run some queries:
Simple HIVE query on my extremely small and low powered HADOOP cluster (23 Seconds)
NOTE: In the HADOOP system, you can see above the HIVE’s map reduce is kicked off
Simple IMPALA query on my extremely small and low powered HADOOP cluster (reading the SAME table as HIVE) (< 1 Second)
NOTE: Impala does not use MAP/REDUCE
With Impala the source table type may impact speeds as well as these 2 simple examples demonstrate.
IMPALA HBASE table (40K records in 4 seconds) :
IMPALA PARQUET Column Store (60 Million Records in 3 Seconds)
HADOOP HBASE source tables are better for small writes and updates, but are slower at reporting.
HADOOP IMPALA PARQUET tables use Column store logic (similar to HANA column tables) which need which take more effort to write too efficiently, but are much faster at reads (assuming not all the fields in a row are return, not that dis-similar to HANA Column tables as well).
You can think of Parquet tables, like the part of the HANA column table after MERGE DELTA, whereas the HBASE table is more like the uncompressed part of a HANA column table PRIOR to MERGE DELTA.
HADOOP tables are still stored on Disk (using HDFS) rather than in memory, however they are making progress in caching tables into memory on the nodes, to better improve performance of queries.
SQL for creating HADOOP Remote Source:
Unfortunately Hadoop remote source can’t be manually configured yet. They do not appear in the drop down:
Since the HADOOP adapter doesn’t appear in the list, use the HANA SQL editor to create the HADOOP Remote Sources:
DROP REMOTE SOURCE HIVE1 CASCADE;
DROP REMOTE SOURCE IMPALA1 CASCADE;
CREATE REMOTE SOURCE HIVE1 ADAPTER “hiveodbc” CONFIGURATION ‘DSN=hwp’
WITH CREDENTIAL TYPE ‘PASSWORD’ USING ‘user=hive;password=hive’;
CREATE REMOTE SOURCE IMPALA1 ADAPTER “hiveodbc” CONFIGURATION ‘DSN=iwp’
WITH CREDENTIAL TYPE ‘PASSWORD’ USING ‘user=hive;password=hive’;
CDH Driver Installation:
Unfortunately Cloudera doesn’t yet provide ODBC drivers for SAP.
I tried some of their other ODBC drivers for Micro Strategy without success.
Fortunately a third party, Progress | Data direct supplies ODBC drivers for HIVE and IMPALA running on CDH.
Dowload their 15 day trial and follow their steps for compiling it for HANA in Linux:
tar -xf PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.2_LINUX_64.tar
In the $HOME directory of your ‘hdbadm‘ user you need to add odbc settings.
Create 2 files:
.customer.sh which adds the location of your new driver to the library path
.odbc.ini which define the ODBC DSN connections used need when creating a Remote Source
My 2 files appear as follows:
[ODBC Data Sources]
iwp=DataDirect 7.1 Impala Wire Protocol
hwp=DataDirect 7.1 Apache Hive Wire Protocol
Description=DataDirect 7.1 Impala Wire Protocol
HostName=[Put the IP address of your HIVE gateway here]
Description=DataDirect 7.1 Apache Hive Wire Protocol
HostName=[Put the IP address of your main IMPALA node here]
UPDATE (Dec 04 2013): Here is an example of the new SPS07 remote caching syntax:
Select * from hive_activity_log where incident_type = ‘ERROR’ and plant =’001’ with hint (USE_REMOTE_CACHE)