We have a Chinese version(SAP HANA Smart Data Access(三)——如何利用SDA通过Hive访问Hadoop数据) of this blog.

Introduction

      In previous blog of this series, we talked about how to install and configure the data source of SDA in SAP HANA Server side. As most data sources supported by SAP HANA are all database, the procedure of installation and configuration is similar. But for Hadoop data source, something is different. As a distributed data processing platform, Hadoop usually store data in HDFS file system, or in NoSQL database HBase which is also usually based on HDFS. However, both of HDFS and HBase don’t support ODBC protocol. So we need another member of Hadoop family  to solve this problem, it is Hive. Hive implements SQL interface for HDFS and HBase, and HiveODBC driver is also provided. In this blog, we’ll talk about how does SDA access the Hadoop data through Hive.

Deploy Hadoop and Hive

      The Official version of Hadoop supported by SAP HANA SDA is “Intel Distribution for Apache Hadoop version 2.3” (Including Apache Hadoop version 1.0.3 and Apache Hive 0.9.0). Although there’s only one version in the official supported version list, the experiment of this blog shows that SDA can also access the data stored in ordinary Apache version of Hadoop. The experiment of this blog build up a Hadoop cluster containing 3 nodes, and the version of Hadoop and Hive is : Apache Hadoop 1.1.1 and Apache Hive 0.12.0.

      As the guide of deploying Hadoop and Hive can be easily found in internet, we don’t discuss it here. After deploying Hadoop and Hive, some data for experiment needs to be prepared. Here, We use a user information table, the structure of the table is :

Column Name

Data Type

USERID

VARCHAR(20)

GENDER

VARCHAR(6)

AGE

INTEGER

PROFESSION

VARCHAR(20)

SALARY

INTEGER

      Data can be imported from csv file into hive table. Firstly, create a table using hive shell.

create table users(USERID string, GENDER string, AGE int, PROFESSION string, SALARY int)

row format delimited

fields terminated by ‘\t’;

      Then, import data from csv file to the users table:

load data local inpath ‘/input/file/path’

overwrite into table users;

Here, the data is imported from local file system, Hive can also import data from HDFS. In this experiment, the number of records in users table is 1,000,000. After importing, count the record number:

/wp-content/uploads/2014/06/1_464921.png

As shown in the picture above, Hive call the MapReduce to query data, and it takes 14.6 seconds to count the record number of users table. Afterwards, select out the top 10 records of the table:

/wp-content/uploads/2014/06/2_464922.png

As we see, it takes 0.1 second.

Installing and configuring HiveODBC Driver

      Same as installing driver for other data sources, installing HiveODBC driver also requires unixODBC installed in the SAP HANA Server side. HiveODBC requires unixODBC-2.3.1 or newer version. For more details about installing unixODBX, please see the reference [2].

      unixODBC installed, Begin to install HiveODBC driver. As introduced in reference [2], we use the HiveODBC provided by Simba Technologies. The procedure of installing is like below:

  1. Download Simba HiveODBC driver package, and decompress the package to a certain directory. Then enter the directory: /<DRIVER_INSTALL_DIR>/samba/hiveodbc/lib/64 (use 32 to replace 64 if it’s for 32-bit system) to check the driver file libsimbahiveodbc64.so.
  2. Login SAP HANA Server as sidadm.
  3. Execute “HDB stop” to stop the SAP HANA.
  4. Copy the file “/<DRIVER_INSTALL_DIR>/simba/hiveodbc/Setup/simba.hiveodbc.ini” to the home directory of sidadm.
  5. Edit the ~/.simba.hiveodbc.ini with vim.
  6. If there’s one row “DriverManagerEncoding=UTF-32”, change it to UTF-16.
  7. Check the ErrorMessagePath = /<DRIVER_INSTALL_DIR>/simba/hiveodbc/ErrorMessages, correct it if it doesn’t points to right path.
  8. Comment out the row: ODBCInstLib=libiodbcint.so, and add a new row: ODBCInstLib=libodbcinst.so.
  9. Edit the .odbc.ini file in home directory of sidadm, add a new DSN for hive, the default port for hive is 10000, here’s an example:

[hive1]

Driver=/<DRIVER_INSTALL_DIR>/simba/hiveodbc/lib/64/libsimbahiveodbc64.so

Host=<IP>

Port=10000

   10. Edit the file $HOME/.customer.sh to set some environment variable:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:

/< DRIVER_INSTALL_DIR>/simba/hiveodbc/lib/64/

export ODBCINI=$HOME/.odbc.ini

   11. Use isql to check whether SAP HANA Server can connect to remote data source successfully:

isql –v hive1

   12. If connect successfully, execute “HDB start” to start SAP HANA.

Create hive data source

      When installing and configuring HiveODBC finished, create the hive data source in SAP HANA Studio following the steps introduced in Reference [1]. Here, you need to choose the HIVEODBC as the adapter.

      After hive data source created, you can view the tables in hive, as shown in picture below:

/wp-content/uploads/2014/06/3_464923.png

Query Hive virtual table

      Add a new virtual table which maps to table users in Hive following the steps introduced in Reference [1]. Then count the record number of the virtual table:

/wp-content/uploads/2014/06/4_464924.png

As shown above, it takes 14.1 seconds to count the virtual table in SAP HANA Studio, which is close to the time it cost in Hive side. The result tells us that SAP HANA SDA doesn’t influence the performance of operation in remote data source when little data transmission involved.

Conclusion

      In this blog, we illustrate how SAP HANA SDA access the Hive table stored in Hadoop using a simple example. Hive is a tool which provide SQL interface for Hadoop. From the experiment result, querying the virtual table in SAP HANA Studio and querying the Hive table in Hive side is very close in performance when little data transmission involved.

Reference

  1. SAP HANA Smart Data Access1)——A brief introduction to SDA
  2. SAP HANA Smart Data Access2)——How to install and configure the data source driver of SDA
  3. Section 6.1.1 of SAP HANA Administrator Guide: http://help.sap.com/hana/SAP_HANA_Administration_Guide_en.pdf
To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

  1. Ram Lokeswara

    Hello Leo,

    We are using SAP HANA SPS 10, connecting Hadoop using spark controller.

    We are able to connect to Hadoop system successfully. But after creating virtual table, we are unable to see the content. we are getting the below error.

    Could not execute ‘SELECT TOP 1000 * FROM “HDPUSER”.”spark_demo_products”‘ SAP DBTech JDBC: [403]: internal error: Error opening the cursor for the remote database for query “SELECT “spark_demo_products”.”productid”,

    SAP HANA SDA.png

    Could you please let us know what is that we are missing to access the virtual tables.

    Thank you.

    Best regards,

    Ram

    (0) 
  2. Joan Sánchez

    Hello Leo,

    I can to connect to my Hadoop system successfully,

    Is possible to do an insert query in a Hive virtual table? I can to do a select query,

    But cannot to do insert or update query,

    Could you tell me why you think this is happening?

    Thank you very much in advance.

    Best regards,

    JoanSE

    (0) 

Leave a Reply