We have a Chinese version(SAP HANA Smart Data Access(三)——如何利用SDA通过Hive访问Hadoop数据) of this blog.
In previous blog of this series, we talked about how to install and configure the data source of SDA in SAP HANA Server side. As most data sources supported by SAP HANA are all database, the procedure of installation and configuration is similar. But for Hadoop data source, something is different. As a distributed data processing platform, Hadoop usually store data in HDFS file system, or in NoSQL database HBase which is also usually based on HDFS. However, both of HDFS and HBase don’t support ODBC protocol. So we need another member of Hadoop family to solve this problem, it is Hive. Hive implements SQL interface for HDFS and HBase, and HiveODBC driver is also provided. In this blog, we’ll talk about how does SDA access the Hadoop data through Hive.
The Official version of Hadoop supported by SAP HANA SDA is “Intel Distribution for Apache Hadoop version 2.3” (Including Apache Hadoop version 1.0.3 and Apache Hive 0.9.0). Although there’s only one version in the official supported version list, the experiment of this blog shows that SDA can also access the data stored in ordinary Apache version of Hadoop. The experiment of this blog build up a Hadoop cluster containing 3 nodes, and the version of Hadoop and Hive is : Apache Hadoop 1.1.1 and Apache Hive 0.12.0.
As the guide of deploying Hadoop and Hive can be easily found in internet, we don’t discuss it here. After deploying Hadoop and Hive, some data for experiment needs to be prepared. Here, We use a user information table, the structure of the table is :
Column Name | Data Type |
USERID | VARCHAR(20) |
GENDER | VARCHAR(6) |
AGE | INTEGER |
PROFESSION | VARCHAR(20) |
SALARY | INTEGER |
Data can be imported from csv file into hive table. Firstly, create a table using hive shell.
create table users(USERID string, GENDER string, AGE int, PROFESSION string, SALARY int)
row format delimited
fields terminated by '\t';
Then, import data from csv file to the users table:
load data local inpath '/input/file/path'
overwrite into table users;
Here, the data is imported from local file system, Hive can also import data from HDFS. In this experiment, the number of records in users table is 1,000,000. After importing, count the record number:
As shown in the picture above, Hive call the MapReduce to query data, and it takes 14.6 seconds to count the record number of users table. Afterwards, select out the top 10 records of the table:
As we see, it takes 0.1 second.
Same as installing driver for other data sources, installing HiveODBC driver also requires unixODBC installed in the SAP HANA Server side. HiveODBC requires unixODBC-2.3.1 or newer version. For more details about installing unixODBX, please see the reference [2].
unixODBC installed, Begin to install HiveODBC driver. As introduced in reference [2], we use the HiveODBC provided by Simba Technologies. The procedure of installing is like below:
[hive1]
Driver=/<DRIVER_INSTALL_DIR>/simba/hiveodbc/lib/64/libsimbahiveodbc64.so
Host=<IP>
Port=10000
10. Edit the file $HOME/.customer.sh to set some environment variable:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:
/< DRIVER_INSTALL_DIR>/simba/hiveodbc/lib/64/
export ODBCINI=$HOME/.odbc.ini
11. Use isql to check whether SAP HANA Server can connect to remote data source successfully:
isql –v hive1
12. If connect successfully, execute “HDB start” to start SAP HANA.
When installing and configuring HiveODBC finished, create the hive data source in SAP HANA Studio following the steps introduced in Reference [1]. Here, you need to choose the HIVEODBC as the adapter.
After hive data source created, you can view the tables in hive, as shown in picture below:
Add a new virtual table which maps to table users in Hive following the steps introduced in Reference [1]. Then count the record number of the virtual table:
As shown above, it takes 14.1 seconds to count the virtual table in SAP HANA Studio, which is close to the time it cost in Hive side. The result tells us that SAP HANA SDA doesn’t influence the performance of operation in remote data source when little data transmission involved.
In this blog, we illustrate how SAP HANA SDA access the Hive table stored in Hadoop using a simple example. Hive is a tool which provide SQL interface for Hadoop. From the experiment result, querying the virtual table in SAP HANA Studio and querying the Hive table in Hive side is very close in performance when little data transmission involved.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
38 | |
19 | |
13 | |
13 | |
10 | |
10 | |
10 | |
8 | |
8 | |
8 |