SAP HANA Smart Data Access(3)—How to access Hadoop data through Hive with SDA
We have a Chinese version(SAP HANA Smart Data Access（三）——如何利用SDA通过Hive访问Hadoop数据) of this blog.
In previous blog of this series, we talked about how to install and configure the data source of SDA in SAP HANA Server side. As most data sources supported by SAP HANA are all database, the procedure of installation and configuration is similar. But for Hadoop data source, something is different. As a distributed data processing platform, Hadoop usually store data in HDFS file system, or in NoSQL database HBase which is also usually based on HDFS. However, both of HDFS and HBase don’t support ODBC protocol. So we need another member of Hadoop family to solve this problem, it is Hive. Hive implements SQL interface for HDFS and HBase, and HiveODBC driver is also provided. In this blog, we’ll talk about how does SDA access the Hadoop data through Hive.
Deploy Hadoop and Hive
The Official version of Hadoop supported by SAP HANA SDA is “Intel Distribution for Apache Hadoop version 2.3” (Including Apache Hadoop version 1.0.3 and Apache Hive 0.9.0). Although there’s only one version in the official supported version list, the experiment of this blog shows that SDA can also access the data stored in ordinary Apache version of Hadoop. The experiment of this blog build up a Hadoop cluster containing 3 nodes, and the version of Hadoop and Hive is : Apache Hadoop 1.1.1 and Apache Hive 0.12.0.
As the guide of deploying Hadoop and Hive can be easily found in internet, we don’t discuss it here. After deploying Hadoop and Hive, some data for experiment needs to be prepared. Here, We use a user information table, the structure of the table is :
Data can be imported from csv file into hive table. Firstly, create a table using hive shell.
create table users(USERID string, GENDER string, AGE int, PROFESSION string, SALARY int)
row format delimited
fields terminated by ‘\t’;
Then, import data from csv file to the users table:
load data local inpath ‘/input/file/path’
overwrite into table users;
Here, the data is imported from local file system, Hive can also import data from HDFS. In this experiment, the number of records in users table is 1,000,000. After importing, count the record number:
As shown in the picture above, Hive call the MapReduce to query data, and it takes 14.6 seconds to count the record number of users table. Afterwards, select out the top 10 records of the table:
As we see, it takes 0.1 second.
Installing and configuring HiveODBC Driver
Same as installing driver for other data sources, installing HiveODBC driver also requires unixODBC installed in the SAP HANA Server side. HiveODBC requires unixODBC-2.3.1 or newer version. For more details about installing unixODBX, please see the reference .
unixODBC installed, Begin to install HiveODBC driver. As introduced in reference , we use the HiveODBC provided by Simba Technologies. The procedure of installing is like below:
- Download Simba HiveODBC driver package, and decompress the package to a certain directory. Then enter the directory: /<DRIVER_INSTALL_DIR>/samba/hiveodbc/lib/64 (use 32 to replace 64 if it’s for 32-bit system) to check the driver file libsimbahiveodbc64.so.
- Login SAP HANA Server as sidadm.
- Execute “HDB stop” to stop the SAP HANA.
- Copy the file “/<DRIVER_INSTALL_DIR>/simba/hiveodbc/Setup/simba.hiveodbc.ini” to the home directory of sidadm.
- Edit the ~/.simba.hiveodbc.ini with vim.
- If there’s one row “DriverManagerEncoding=UTF-32”, change it to UTF-16.
- Check the ErrorMessagePath = /<DRIVER_INSTALL_DIR>/simba/hiveodbc/ErrorMessages, correct it if it doesn’t points to right path.
- Comment out the row: ODBCInstLib=libiodbcint.so, and add a new row: ODBCInstLib=libodbcinst.so.
- Edit the .odbc.ini file in home directory of sidadm, add a new DSN for hive, the default port for hive is 10000, here’s an example:
10. Edit the file $HOME/.customer.sh to set some environment variable:
11. Use isql to check whether SAP HANA Server can connect to remote data source successfully:
isql –v hive1
12. If connect successfully, execute “HDB start” to start SAP HANA.
Create hive data source
When installing and configuring HiveODBC finished, create the hive data source in SAP HANA Studio following the steps introduced in Reference . Here, you need to choose the HIVEODBC as the adapter.
After hive data source created, you can view the tables in hive, as shown in picture below:
Query Hive virtual table
Add a new virtual table which maps to table users in Hive following the steps introduced in Reference . Then count the record number of the virtual table:
As shown above, it takes 14.1 seconds to count the virtual table in SAP HANA Studio, which is close to the time it cost in Hive side. The result tells us that SAP HANA SDA doesn’t influence the performance of operation in remote data source when little data transmission involved.
In this blog, we illustrate how SAP HANA SDA access the Hive table stored in Hadoop using a simple example. Hive is a tool which provide SQL interface for Hadoop. From the experiment result, querying the virtual table in SAP HANA Studio and querying the Hive table in Hive side is very close in performance when little data transmission involved.