Skip to Content

..

“SAP HANA smart data access enables remote data to be accessed as if they are local tables in SAP HANA, without copying the data into SAP HANA. Not only does this capability provide operational and cost benefits, but most importantly it supports the development and deployment of the next generation of analytical applications which require the ability to access, synthesize and integrate data from multiple systems in real-time regardless of where the data is located or what systems are generating it.”

Reference:  http://help.sap.com/hana/Whats_New_SAP_HANA_Platform_Release_Notes_en.pdf    Section 2.4.2

..

Currently Supported databases by SAP HANA smart data access include:

  1. Teradata Database: version 13.0SAP
  2. Sybase IQ: version 15.4 ESD#3 and 16.0
  3. SAP Sybase Adaptive Service Enterprise: version 15.7 ESD#4
  4. Intel Distribution for Apache Hadoop: version 2.3 (This includes Apache Hadoop version 1.0.3 and Apache Hive 0.9.0.)

Also Refer to:

SAP Note 1868209: Additional information about SPS06 and smart data access

SAP Note 1868702: Information about installing the drivers that SAP HANA smart data access supports

UPDATE (Dec 04 2013): As of SPS07 Hortonworks HDP1.3 (When’s HDP 2.0 coming?) appears to have been added to the official list, and remote caching of HADOOP Sources has been added, which should hopefully speed queries up for those tables in HADOOP that aren’t changing frequently.

http://www.saphana.com/servlet/JiveServlet/previewBody/4296-102-1-9005/HANA_SPS07_NEW_SDA.pdf

UPDATE (Jan 29 2014):  SAP HANA Academy now has a great collection of videos using Smart Data Access.  Thanks SAP 🙂

e.g.

SAP HANA Academy | SAP HANA

SDA: HADOOP — Configuring ODBC drivers | SAP HANA

SDA: HADOOP — using the remote HADOOP data source | SAP HANA

Using Smart Data Access (SDA) with HADOOP seems to me a great idea for balancing the strengths of both tools.  Unfortunately for real-time responsiveness  HIVE SQL currently isn’t the most optimal tool in HADOOP [instead it’s better used for batched SQL commands].  Cloudera’s Impala, Hortonworks Stinger initiative  and  MapR’s Drill  are all trying to address real-time reporting.

I’ve only tested Impala so far, but I’ve noticed speeds of 10 to 100 times improvement over standard HIVE SQL queries. With that in mind I thought it would be interesting to test them both in HANA using SDA.

Unfortunately I’m using Cloudera’s open-source Apache Hadoop distribution (CDH), which isn’t on SAP’s approved list yet. However since SDA uses ODBC I’ve managed to get it working using a third party ODBC driver from Progress|DataDirect. http://www.datadirect.com/products/odbc/index.html

NOTE: Since CDH is not currently on this list I’m sure SAP will NOT recommend you using this in a production environment. If you do though get it working in a sandbox environment why not help by adding your voice for it be certified and added to the ‘official’ list.

With the disclaimers out of the way this is how SDA works.

Remote Data Sources: 

Once you have your ODBC drivers install properly Remote Sources can be added for both HIVE and IMPALA

Expanding the Remote Sources shows the tables that can be access by HANA.


NOTE: For me expanding the HIVE1 tree takes almost 20 seconds each time expanding a node (perhaps it uses mapreduce?),  IMPALA1 nodes in the hierarchy expanded quickly.

In the above screen shots you will notice that both HIVE1 & IMPALA1 share the same tables as they use the same HADOOP metastore. Data is NOT replicated in HIVE tables and IMPALA tables. The metastore just points to the tables files location within the HADOOP ecosystem, whether stored as text files, HBASE tables or column store PARQUET files (to list just a few).

There are some tables types (file types) that can only be read by HIVE or IMPALA, but there is a large overlap and this may converge over time.

Virtual Tables:

Select Create virtual tables, from your Remote Source, in the schema of your choice.


NOTE: I’ve previously created an ‘HADOOP’ schema in HANA to store these virtual tables.

Once created you can open the definition of the new virtual tables, as per normal HANA tables.

Run some queries:

Simple HIVE query on my extremely small and low powered HADOOP cluster (23 Seconds)

NOTE: In the HADOOP system, you can see above the HIVE’s map reduce is kicked off

Simple IMPALA query on my extremely small and low powered HADOOP cluster (reading the SAME table as HIVE) (< 1 Second)

NOTE: Impala does not use MAP/REDUCE

With Impala the source table type may impact speeds as well as these 2 simple examples demonstrate.

IMPALA  HBASE table  (40K records in 4 seconds) :

IMPALA  PARQUET Column Store (60 Million Records in 3 Seconds)

HADOOP HBASE source tables are better for small writes and updates, but are slower at reporting.

HADOOP IMPALA PARQUET tables use Column store logic (similar to HANA column tables) which need which take more effort to write too efficiently, but are much faster at reads (assuming not all the fields in a row are return, not that dis-similar to HANA Column tables as well).

You can think of Parquet tables, like the part of the HANA column table after MERGE DELTA, whereas the HBASE table is more like the uncompressed part of a HANA column table PRIOR to MERGE DELTA.

HADOOP tables are still stored on Disk (using HDFS) rather than in memory, however they are making progress in caching tables into memory on the nodes, to better improve performance of queries.

SQL for creating HADOOP Remote Source:

Unfortunately Hadoop remote source can’t be manually configured yet.  They do not appear in the drop down:

Since the HADOOP adapter doesn’t appear in the list, use the HANA SQL editor to create the HADOOP Remote Sources:

e.g.

DROP REMOTE SOURCE HIVE1 CASCADE;

DROP REMOTE SOURCE IMPALA1 CASCADE;

CREATE REMOTE SOURCE HIVE1 ADAPTER “hiveodbc” CONFIGURATION ‘DSN=hwp’

    WITH CREDENTIAL TYPE ‘PASSWORD’ USING ‘user=hive;password=hive’;

CREATE REMOTE SOURCE IMPALA1 ADAPTER “hiveodbc” CONFIGURATION ‘DSN=iwp’ 

   WITH CREDENTIAL TYPE ‘PASSWORD’ USING ‘user=hive;password=hive’;

CDH Driver Installation:

Unfortunately Cloudera doesn’t yet provide ODBC drivers for SAP.

I tried some of their other ODBC drivers for Micro Strategy without success.

Fortunately a third party, Progress | Data direct supplies  ODBC drivers for  HIVE and IMPALA running on CDH.

http://www.datadirect.com/products/odbc/index.htmlhttp://www.datadirect.com/products/odbc/index.html

Dowload their 15 day trial and follow their steps for compiling it for HANA in Linux:

e.g.

wget http://www.datadirect.com/download/files/evals/connect64_odbc/712/PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.2_LINUX_64.tar.Z
gunzip PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.2_LINUX_64.tar.Z
tar -xf PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.2_LINUX_64.tar
./unixmi.ksh

In the $HOME directory of  your ‘hdbadm‘ user you need to add odbc settings.

Create 2 files:  

  .customer.sh   which adds the location of your new driver to the library path

  .odbc.ini         which define the ODBC DSN connections used need when creating a Remote Source

My 2 files appear as follows:

.customer.sh

—————–

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib:/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib

export ODBCINI=$HOME/.odbc.ini

.odbc.ini

———–

[ODBC Data Sources]

iwp=DataDirect 7.1 Impala Wire Protocol

hwp=DataDirect 7.1 Apache Hive Wire Protocol

[ODBC]

IANAAppCodePage=4

InstallDir=/opt/Progress/DataDirect/Connect64_for_ODBC_71

Trace=0

TraceFile=/tmp/odbctrace.out

TraceDll=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddtrc27.so

[iwp]

Driver=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddimpala27.so

Description=DataDirect 7.1 Impala Wire Protocol

ArraySize=1024

Database=default

DefaultLongDataBuffLen=1024

DefaultOrderByLimit=-1

EnableDescribeParam=0

HostName=[Put the IP address of your HIVE gateway here]

LoginTimeout=30

LogonID=

Password=

PortNumber=21050

RemoveColumnQualifiers=0

StringDescribeType=-9

TransactionMode=0

UseCurrentSchema=0

[hwp]

Driver=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddhive27.so

Description=DataDirect 7.1 Apache Hive Wire Protocol

ArraySize=16384

Database=default

DefaultLongDataBuffLen=1024

EnableDescribeParam=0

HostName=[Put the IP address of your main IMPALA node here]

LoginTimeout=30

LogonID=hive

MaxVarcharSize=2147483647

Password=

PortNumber=10000

RemoveColumnQualifiers=0

StringDescribeType=12

TransactionMode=0

UseCurrentSchema=0

WireProtocolVersion=0



UPDATE (Dec 04 2013):  Here is an example of the new SPS07 remote caching syntax:

Select * from hive_activity_log where incident_type = ‘ERROR’ and plant =’001’ with hint (USE_REMOTE_CACHE)

To report this post you need to login first.

47 Comments

You must be Logged on to comment or reply to a post.

  1. Bill Ramos

    Hi Aron,

       Great write up! I’ll have to try my hand at trying to get this running with AWS EMR and Hive real soon and it looks like your post will save me a bunch of time. I’m going to try out a new HIVE ODBC driver from Simba Technologies http://www.simba.com/connectors/apache-hadoop-hive.

        BTW, thanks for setting up the Google doc with your Impala findings with links to setting up the environment at https://docs.google.com/file/d/0Bxydpie8Km_fWTd3RmJTbjVHd00/edit?usp=sharing

    Regards,

    Bill

    (0) 
    1. Aron MacDonald Post author

      Thanks Bill. 

      I do hope you also manage to get it working. Spinning out a temporary HADOOP cluster using AWS EMR should produce much better run times than mine shown above, depending on number and size of nodes used.

      It should provide some interesting results. Please feel free to share your results here or on a new blog.

      As I’m sure you are aware AWS EMR can currently spin out two different distributions of HADOOP (Amazon distribution or MAPR distribution).   If you have HIVE ODBC driver issues with one you could always try the other.

      http://aws.amazon.com/elasticmapreduce/mapr/

      If you’ve not seen it yet, the HANA Academy has a video showing the setup of SDA for Sysbase IQ 16. It’s also a useful point of reference.

      http://www.youtube.com/watch?v=SVTt0T3yfO8

      All the best and good luck.

      Aron

      (0) 
      1. Bill Ramos

        After a bunch of trial and error, I finally got the Simba ODBC driver working with AWS EMR and Hive .11. I’ll need to document my adventures in a new blog post. It’s interesting to note, that Hive .11 introduced v1 of Stinger with the ORC file format. I found it pretty much non-functional. I guess, we;ll need to wait for Hadoop 2.0 and the full version of Stinger.

        Thanks Aron for blazing the trail.

        Regards,

        Bill

        (0) 
        1. Aron MacDonald Post author

          That’s great to hear you also have it working now. I definitely am looking forward to reading your adventures.  🙂

          Cloudera CDH 4.4.0 is now running HADOOP 2.0, but they are still running HIVE 0.10 (which excludes Hortonworks Stinger Phase 1 changes (included in HIVE 0.11).

          I think I read somewhere that in Q4 or Q1 next year Cloudera will move to a later version of HIVE, probably version 0.11, but would be great if it’s HIVE 0.12 (Phase 2 Stinger). It’d be cool to benchmark Impala and an improved HIVE version in the same Cluster, connected to HANA.

          Then also the HADOOP Column Store file formats (Parquet/Impala & ORC/HIVE) can be compared.  This rivalry is great for fast innovation  but who will win.  The good news is that who ever comes out on top it isn’t that hard copying the data between the formats. Whew!!

          (0) 
  2. Monissha M.T Agil

    Hello Aron,

    Thank you so much for sharing.

    I have installed Apache Hadoop(1.1.2) and Hive (0.10.0) in my VM and downloaded the Apache HIve ODBC driver form dataDirect and installed the driver in my HANA machine.In the Hana studio, after driver installation I am getting the HADOOP(ODBC) option in the dropdown list. But then I used the SQL statement (as you have specified above) to add the remote source. The remote source(HIVE) has been successfully added but on expanding the added node (HIVE) I am getting error as “SAP DBTech JDBC[403]: internal error Cannot get the remote source Objects:[unixODBC][Driver Manager] Cant open lib ‘/usr/sap/xse/home/driver/lib/ddhive27.so : file not found'”

    But this file i.e. (ddhive27.so )actually exists in the path. I have added the IP address of the Hadoop node , Username and password for the same node in the .odbc.ini file.

    So could you please help me out in fixing this issue?

    Thank you.

    Best Regards,

    Monissha

    (0) 
    1. Aron MacDonald Post author

      Hi Monissha,

      Great to hear someone else is trying this. 🙂

      Firstly you might want to just check file permissions and restart HANA if required.

      You also might want to use the following command to check if any other associated shared library files are missing:

      ldd /usr/sap/xse/home/driver/lib/ddhive27.so

      If any associated files are missing you will need to source these and install.

      You might also want to test the connection from HANA to HIVE at the linux level using ‘isql’, if you have it installed.  It might give you a more detailed error.

      Let me know how you get on.

      Kind Regards

      Aron

      (0) 
      1. Monissha M.T Agil

        Hello Aron,

        Thank you for your suggestions:)

        I have changed the permission of the file and then restarted the HANA system.

        When I add the remote source now, I get the error as “SAP DBTech JDBC[403]: internal error: Cannot get remote source objects: (Some Special CHARACTERS)”.

        Also running isql command I get error as Could not SQLConnect.

        But it didnt give any detailed error other than this.

        But I am using the username and password as the VM’s username and password where I have installed the Hadoop and Hive. Is that right?

        Thanks & Regards,

        Monissha

        (0) 
        1. Aron MacDonald Post author

          Hi Monissha,

          Ok good, we got passed the first error at least.  🙂

          I presume you are using Cloudera Distribution of Hadoop.  Otherwise you might need a ODBC driver specific to your distribution.

          If you are defintely using the Cloudera distribution then go into ‘Cloudera Manager’, under:

          Serivces -> Hive

          Service-Wide -> Hive Metastore Database

          There you will find the HIVE userid and password I used in HANA when executing the ‘CREATE REMOTE SOURCE‘ statement.

          In addition I used port 10000, which in my environment represents Hiverserver2, which has more advanced features than the older Hiveserver1.

          The follow slide is old but gives some indication of the differences.

          http://www.slideshare.net/cwsteinbach/hiveserver2-for-apache-hive

          If your hiverserver2 is not currently running then you can launch it using the following statement:

          hive –service hiveserver2

          I hope this gets you closer.

          Regards

          Aron

          (0) 
  3. Ganesh Mahadevan V

    Excellent Blog Aron. I am going to try this.. in my lab…

    In short what are the pre-requisites to try HIVE and SAP HANA ?

    Really appreciate your help.

    Thank you

    regards,

    ganesh

    (0) 
    1. Aron MacDonald Post author

      Thanks.  Please do give it a try  🙂

      My example above relies on having an HADOOP Cluster running, on separate machine(s).  HADOOP doesn’t run on a HANA box.

      In the following blog I provide a bunch of links of how to setup an HADOOP cluster, using the Cloudera distribution. 

      http://scn.sap.com/community/developer-center/hana/blog/2013/08/07/streaming-real-time-data-to-hadoop-and-hana

      But please take not that the Cloudera Distribution isn’t yet covered by SPS06 release notes (see above).

      Unfortunately I don’t have details on how to setup the Intel distribution of HADOOP.

      Cheers

      Aron

      (0) 
  4. souri ratnaparkhi

    Hi Aron,

    Very nice blog.

    Can you please help me with this-:

    I have successfully done ODBC connection between hive and SAP hana using Simba driver. i followed the steps as per SAP hana approved documents.

    Then i created a table on hive server. I loaded data from .csv file into the same. Then i fired query “select * from table_name”  on sap hana server, but all data did not get retrieved. Only first entry gets displayed. This is the output which i get

       select * from emp5
    +————+—————————————————————————————————————————————————————————————————————————————————————-+
    | empid |

    name 

         |                                                                                                                                                                                                                                                          |
    +————+—————————————————————————————————————————————————————————————————————————————————————-+
    | 1          |
    |
    |
    |
    |
    |
    |
    |
    |
    |
    +————+—————————————————————————————————————————————————————————————————————————————————————-+
    SQLRowCount returns -1
    10 rows fetched

    The output when i fire same command on hive server-:

    1   ABC
    2   ABCD
    3   ABCDE
    4   EFG
    5   HIJ
    6   KLM
    7   PQR
    8 STU
    9   VW
    10  XYZ

    Can you please let me know where i am going wrong. Is the issue from my side or from SAP hana.

    (0) 
    1. Aron MacDonald Post author

      Hi,

      Unfortunately  I’ve not used the SIMBA driver. 

      I presume you were getting the blank results using ‘isql’ from your HANA box’s linux prompt.

      Did you try anyway to add the remote data source & virtual table from HANA studio?

      Does the HANA SQL editor also return blank rows when you perform the select statement? 

      If you are really lucky perhaps it’s just ‘isql’ that is having a problem interpreting the ODBC connection, perhaps due to the unicode type of the returned dataset???

      Alternatively you could try an alternative ODBC driver.

      Sorry I can’t be more help.

      Regards

      Aron


      (0) 
  5. souri ratnaparkhi

    Hi Aron,

    Now i am trying ODBC connection by your method. I downloaded the data direct odbc drivers from location specified by you. I kept the drivers in location /opt/hive/ on SAP hana server. I setup cloudera distribution on single machine. As i want only connection between hive and HANA my .odbc file looks like this

      [hive1]

    Driver=/opt/hive/lib/ddhive27.so

    Description=DataDirect 7.1 Apache Hive Wire Protocol

    ArraySize=16384

    Database=default

    DefaultLongDataBuffLen=1024

    EnableDescribeParam=0

    HostName=master

    LoginTimeout=30

    LogonID=hive

    MaxVarcharSize=2147483647

    Password=hive

    PortNumber=10000

    RemoveColumnQualifiers=0

    StringDescribeType=12

    TransactionMode=0

    UseCurrentSchema=0

    WireProtocolVersion=0

    my .profile file entries are

    export ODBCINI=$HOME/.odbc.ini
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib:/opt/hive/lib

    but when i run sql command for DSN connection in hana studio i get following error

    “SAP DBTech JDBC[403]: internal error Cannot get the remote source Objects:[unixODBC][Driver Manager] Cant open lib ‘/opt/hive/lib/ddhive27.so : file not found’

    And when i fired command “isql hive1” in SAP hana terminal i got following error

    Inconsistency detected by ld.so: dl-open.c: 583: _dl_open: Assertion `_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT’ failed!

    Can you please help me?

    (0) 
    1. Aron MacDonald Post author

      Please check your Driver  path matches the exact location of the HIVE ODBC driver you have downloaded. 

      If you look at my example above I have specified:

      Driver=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddhive27.so

      I hope that helps.

      (0) 
  6. souri ratnaparkhi

    Hi Aron,

    I successfully solved the previous error. But now i am facing new issue

    When i do isql -v hive1 i get following error

    [08001][unixODBC][DataDirect][ODBC Apache Hive Wire Protocol driver]Client unable to establish connection.Error validating the login

    As mentioned by you in one of the comments above, i inserted username and password from

    Manager’, under:

    Serivces -> Hive

    Service-Wide -> Hive Metastore Database

    Does hiveserver2 has some different login credentials? Please let me know.

    thank you

    (0) 
    1. Aron MacDonald Post author

      Hi Souri,

      Did you try:

      isql -v hive1 -user <from Hive config> -password <from Hive config>

      I only tried to get it to work with hiveserver2, so are you able to confirm it is running?

      I’m not sure it should make a difference but I also setup my Hadoop cluster to use ‘mysql’ instead of default ‘derby’ for the Hive metastore.

      This link has steps for setting this up:

      http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

      Hope that helps.

      Aron



      (0) 
      1. Teja sistu

        Hi Aron,

        Wow very good information and thanks for the post. Will try now in our lab…..

        Can I get these original version drivers in SAP service market place? if please let me know the path….

        Teja

        (0) 
        1. Aron MacDonald Post author

          I’m not aware of any HADOOP Hive drivers being on the SAP service market place.

          Please check out the OSS notes and links above, they may help.

          (0) 
  7. John Barton

    Hi Aron,

    I am on-site doing a POC with HANA 68 and successfully using the Cloudera provided ODBC driver which is the driver that Cloudera provides which they OEM from Simba. I will need to come back and add some notes we had a bit trouble getting it working so I will post the ODBC DSN entry details which should help others.

    Great documentation. I can also provide info on setting up the same for the Intel Hadoop Distribution which I have assisted in setting up for another POC. In that case we used a ODBC driver provided by Intel which is similar to the Cloudera OEM version but slightly different.

    Glad to see others are doing similar things. Also if anyone has any issues with setting up a connection to IQ I have experience doing that as well.

    In addition, I figured out how to do a create table statement to adjust the width of the string data types in the virtual table definition and by doing so reduced the memory needed to bring in more rows. I need to post that as well.

    John

    (0) 
    1. Aron MacDonald Post author

      Thanks John.  I look forward to hearing the ODBC settings you’ve tried with the SIMBA driver. Did you use the HIVE or IMPALA Cloudera ODBC driver?

      I’ve also recently tried Cloudera’s new IMPALA ODBC driver, but haven’t got it quite working yet, so haven’t updated this blog with those details.  I’m on HANA Rev 60.  I can setup the connection in HANA and create a virtual table, but so far only ‘select count(*)’ works from the SQL editor, as soon as I try a proper query I get a JDBC error in HANA studio. Since you mentioned something about string data type I wouldn’t be surprised if it’s related.

      (0) 
      1. John Barton

        Here are the instructions for configuring SDA to access Cloudera and Hive Server 2 against HANA 68

        Cloudera ODBC Installation:

        This is the OEM version of the Simba driver from Cloudera:

        http://www.cloudera.com/content/support/en/downloads/download-components/download-products/downloads-listing/connectors/cloudera-odbc-drivers.html

        Make sure you download the hive odbc driver. I have not tried the Impala driver which is seperate I have only tried Hive Server 2

        After installation of unixODBC (ODBC driver manager) and odbc driver

        Default installation the driver installs here:

        ls -l /opt/cloudera/hiveodbc
        ls -l /opt/cloudera/hiveodbc/lib/64
        ls -l /opt/cloudera/hiveodbc/Setup/

        You will also find the pdf installation guide in the installation directory

        rnop-tmphan1:~> ls -l /opt/cloudera/hiveodbc

        -r–r–r– 1 root root  11995 Jul 24 18:23 Cloudera-EULA.txt
        -r–r–r– 1 root root 884481 Jul 25 09:39 Cloudera ODBC Driver for Apache Hive Install Guide.pdf
        drwx—r-x 3 root root   4096 Nov  1 13:18 ErrorMessages
        drwx—r-x 3 root root   4096 Nov  1 13:18 lib
        -r–r–r– 1 root root   3476 Jul 24 18:23 Readme.txt
        -r–r–r– 1 root root    214 Jul 24 18:23 Release Notes.txt
        drwx—r-x 2 root root   4096 Nov  1 13:18 Setup

        rnop-tmphan1:~> ls -l /opt/cloudera/hiveodbc/lib/64
        total 45236
        -rwxr-xr-x 1 root root 43909314 Jul 24 18:23 libclouderahiveodbc64.so
        lrwxrwxrwx 1 root root       18 Nov  1 13:19 libcrypto.so -> libcrypto.so.1.0.0
        -rwxr-xr-x 1 root root  1943353 Jul 24 18:23 libcrypto.so.1.0.0
        lrwxrwxrwx 1 root root       15 Nov  1 13:19 libssl.so -> libssl.so.1.0.0
        -rwxr-xr-x 1 root root   400543 Jul 24 18:23 libssl.so.1.0.0
        -r–r–r– 1 root root      592 Jul 24 18:23 SimbaHiveODBC.did

        rnop-tmphan1:~> ls -l /opt/cloudera/hiveodbc/Setup/
        total 12
        -rwxr-xr-x 1 root root 1146 Jul 24 18:23 cloudera.hiveodbc.ini
        -rwxr-xr-x 1 root root 2013 Jul 24 18:23 odbc.ini
        -rwxr-xr-x 1 root root  641 Jul 24 18:23 odbcinst.ini

        rnop-tmphan1:~> zypper se odbc

        Loading repository data…
        Reading installed packages…

        S | Name             | Summary                                        | Type
        –+——————+————————————————+——–
        i | ClouderaHiveODBC | Cloudera ODBC Driver for Apache Hive           | package
        i | unixODBC         | ODBC driver manager with some drivers included | package

        rnop-tmphan1:~> zypper info unixODBC
        Loading repository data…
        Reading installed packages…

        Information for package unixODBC:

        Repository: @System
        Name: unixODBC
        Version: 2.2.12-198.17
        Arch: x86_64
        Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany
        Support Level: unknown
        Installed: Yes
        Status: up-to-date
        Installed Size: 1.6 MiB
        Summary: ODBC driver manager with some drivers included
        Description:
        UnixODBC aims to provide a complete ODBC solution for the Linux
        platform. Further drivers can be found at http://www.unixodbc.org/.

        rnop-tmphan1:~> zypper info ClouderaHiveODBC
        Loading repository data…
        Reading installed packages…

        Information for package ClouderaHiveODBC:

        Repository: @System
        Name: ClouderaHiveODBC
        Version: 2.5.0.1001-1
        Arch: x86_64
        Vendor: Cloudera Inc.
        Support Level: unknown
        Installed: Yes
        Status: up-to-date
        Installed Size: 45.1 MiB
        Summary: Cloudera ODBC Driver for Apache Hive
        Description:
        An ODBC driver providing connectivity to an Apache Hive data source.

        rnop-tmphan1:~> ls -la ~/
        total 140
        drwxr-x— 13 hn1adm sapsys 4096 Nov  6 10:43 .
        drwxr-xr-x  5 hn1adm sapsys 4096 Oct 23 12:12 ..
        -rwxr-x—  1 hn1adm sapsys 1147 Nov  1 13:28 .cloudera.hiveodbc.ini
        -rwxr-x—  1 hn1adm sapsys  997 Nov  6 10:43 .odbc.ini

        These two files listed above get copied out of the ls -l /opt/cloudera/hiveodbc/Setup/ directory and get renamed to <dot> files.

        I think most of the instructions are not that great, I would rather set the environment variables in the .profile. These files and modifications are for the account that starts the HANA Server.

        su – hn1adm

        rnop-tmphan1:~> cat .profile

        #  *********************************************************************
        # ***********************************************************************
        # ****                                                               ****
        # ***           $HOME/.profile                                        ***
        # ***           init script bash                                      ***
        # ***                                                                 ***
        # ***                                                                 ***
        # ***           To allow correct administration for SAP R/3, the      ***
        # ***           following section in this file is required and may    ***
        # ***           not be changed:                                       ***
        # ***                                                                 ***
        # ***             # SAP environment                                   ***
        # ***                                                                 ***
        # ***           To facilitate Hotline Support nothing else should be  ***
        # ***           changed. If necessary site-specific modifications     ***
        # ***           can be included in the file .customer.sh              ***
        # ***           which will be sourced by this script via .bashrc      ***
        # ***                                                                 ***
        # ***           Copyright (c) 2012 SAP-AG                             ***
        # ***                                                                 ***
        # ****                                                               ****
        # ***********************************************************************
        #  *********************************************************************

        # SAP environment
        if [ -f $HOME/.bashrc ]; then
             . $HOME/.bashrc
        fi # SAP environment

        #
        # Added by SAP JB on 10-28-2013
        #
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/sap/HN1/home/simba/hiveodbc/lib/64/
        export ODBCINI=/usr/sap/HN1/home/.odbc.ini

        Logout and login and check the environment variables:

        rnop-tmphan1:~> echo $LD_LIBRARY_PATH
        /usr/sap/HN1/HDB00/exe:/usr/sap/HN1/HDB00/exe/Python/lib:/usr/sap/HN1/HDB00/exe/filter:/usr/sap/HN1/HDB00/exe/dat_bin_dir:/usr/sap/HN1exe/plugins/afl:/usr/sap/HN1/HDB00/exe/plugins/lcapps:/usr/sap/HN1/SYS/global/security/lib:/usr/sap/HN1/SYS/global/hdb/federation:/usr1/HDB00/exe:/usr/sap/HN1/HDB00/exe/Python/lib:/usr/sap/HN1/HDB00/exe/filter:/usr/sap/HN1/HDB00/exe/dat_bin_dir:/usr/sap/HN1/HDB00/exe//afl:/usr/sap/HN1/HDB00/exe/plugins/lcapps:/usr/sap/HN1/SYS/global/security/lib:/usr/sap/HN1/SYS/global/hdb/federation:/usr/sap/HN1/hoa/hiveodbc/lib/64/

        rnop-tmphan1:~> echo $ODBCINI
        /usr/sap/HN1/home/.odbc.ini

        Create some DSN’s the first one is for hive and the second one for HANA SDA to HANA

        rnop-tmphan1:~> cat .odbc.ini

        [ODBC]
        # Specify any global ODBC configuration here such as ODBC tracing.

        [ODBC Data Sources]
        Sample Cloudera Hive DSN 64=Cloudera ODBC Driver for Apache Hive 64-bit

        [HIVE1]

        # Description: DSN Description.
        # This key is not necessary and is only to give a description of the data source.
        Description=Cloudera ODBC Driver for Apache Hive (64-bit) DSN

        # Driver: The location where the ODBC driver is installed to.
        Driver=/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so

        # The DriverUnicodeEncoding setting is only used for SimbaDM
        # When set to 1, SimbaDM runs in UTF-16 mode.
        # When set to 2, SimbaDM runs in UTF-8 mode.
        DriverUnicodeEncoding=1

        # Values for HOST, PORT, HS2HostFQDN, and HS2KrbServiceName should be set here.
        # They can also be specified on the connection string.
        HOST=10.210.43.123
        PORT=10000
        HiveServerType=2
        HS2AuthMech=2
        UserName=hdfs

        [HANAHN1]
        ServerNode=10.210.46.144:30015
        Driver=libodbcHDB.so

        The unixODBC requires making changes to the  .cloudera.hiveodbc.ini file:

        rnop-tmphan1:~> cat .cloudera.hiveodbc.ini
        [Driver]

        ## – Note that this default DriverManagerEncoding of UTF-32 is for iODBC.
        ## – unixODBC uses UTF-16 by default.
        ## – If unixODBC was compiled with -DSQL_WCHART_CONVERT, then UTF-32 is the correct value.
        ##   Execute ‘odbc_config –cflags’ to determine if you need UTF-32 or UTF-16 on unixODBC
        ## – SimbaDM can be used with UTF-8 or UTF-16.
        ##   The DriverUnicodeEncoding setting will cause SimbaDM to run in UTF-8 when set to 2 or UTF-16 when set to 1.

        DriverManagerEncoding=UTF-16
        ErrorMessagesPath=/opt/cloudera/hiveodbc/ErrorMessages/
        LogLevel=0
        LogPath=

        ## – Uncomment the ODBCInstLib corresponding to the Driver Manager being used.
        ## – Note that the path to your ODBC Driver Manager must be specified in LD_LIBRARY_PATH (LIBPATH for AIX).
        ## – Note that AIX has a different format for specifying its shared libraries.

        # Generic ODBCInstLib
        #   iODBC
        # ODBCInstLib=libiodbcinst.so

        #   SimbaDM / unixODBC
        ODBCInstLib=libodbcinst.so

        # AIX specific ODBCInstLib
        #   iODBC
        #ODBCInstLib=libiodbcinst.a(libiodbcinst.so.2)

        #   SimbaDM
        #ODBCInstLib=libodbcinst.a(odbcinst.so)

        #   unixODBC
        #ODBCInstLib=libodbcinst.a(libodbcinst.so.1)

        At this point you should be able to connect to hive using isql which is part of the unixODBC package. Not to be confused with the SAP Sybase open client isql utility for connecting to ASE and IQ:

        rnop-tmphan1:~> isql -v HIVE1
        +—————————————+
        | Connected!                            |
        |                                       |
        | sql-statement                         |
        | help [tablename]                      |
        | quit                                  |
        |                                       |
        +—————————————+
        SQL> show tables
        +—————————————————————————————————————————————————————————————————————————————————————-+
        | tab_name                                                                                                                                                                                                                                                       |
        +—————————————————————————————————————————————————————————————————————————————————————-+
        | dual                                                                                                                                                                                                                                                           |
        |
        +—————————————————————————————————————————————————————————————————————————————————————-+
        SQLRowCount returns -1
        2 rows fetched

        Not sure why may myhivetable is not showing up above.

        SQL> select count(*) from myhivetable;
        +———————+
        | EXPR_1              |
        +———————+
        | 226112509           |
        +———————+
        SQLRowCount returns -1
        1 rows fetched
        SQL>

        At this point you need to restart the HANA Server. So that it picks up the new environment which will allow it to find the odbc driver in the LD_LIBRARY_PATH.

        Note: if you are doing SDA to SAP IQ then you might want to add that last to the LD_LIBRARY_PATH. The IQ driver if installed using the IQ Client kit the IQ ODBC driver can act as the ODBC Driver manager. So you want to make sure the driver manager gets loaded prior to the IQ driver

        Now that we have HANA restarted with the driver in the LD_LIBRARY_PATH and ODBCINI pointing at the .odbc.ini file we can create the remote server entry for the Hive Server.

        This was my first exposure to Hive Server 2 and the connectivity was not working until I got the

        DSN info correct. Open up a SQL window in HANA Studio and run the create remote server command:

        CREATE REMOTE SOURCE HIVE1 ADAPTER “hiveodbc” CONFIGURATION ‘DSN=hive1’
        WITH CREDENTIAL TYPE ‘PASSWORD’ USING ‘user=hdfs;password=hdfs’;

        If you need to add a HANA Server here is the command:

        CREATE REMOTE SOURCE HANA1 ADAPTER “hanaodbc” CONFIGURATION ‘DSN=HANA1’
        WITH CREDENTIAL TYPE ‘PASSWORD’ USING ‘user=SYSTEM;password=HP1nv3nt’;

        At this point you can expand the Provisioning entry and you should see the remote server:

        The easiest way to create the virtual table is in HANA Studio, right click on table and select create virtual table. The alternative is to use a DDL command. This has an advantage that you can size the Hadoop string data types to better match the data. Hadoop is not so strongly typed as relational databases where char and varchar typically have a length specified. This can help limit the memory required when selecting back data from Hive.

        create virtual table MYSCHEMA.MYVIRTUALTABLE (
        COLUMN1                                     NCHAR(12),
        COLUMN2                                    NCHAR(12),
        COLUMN3                                    INTEGER,
        COLUMN4                                    BIGINT,
        ….

        COLUMN_N                                    NCHAR(40)
        ) at “HIVE1″.”default”.”default”.”my_hive_table”;

        The default behavior is for the simba driver to bring every string in as 255 length. I think I read the default behavior can be adjusted.

        If you are having trouble getting ODBC DSN working it might be worth installing the windows version of the Hive driver and configuring that and making sure it connects. That is what I ended up doing as I had trouble with the authentication defaulting different under Hive Server 2.

        I followed these same instruction in working with the Intel Hadoop Distribution IDH which was running Hive Server 1 at the time I tested this.

        If you are not familiar with SAP Sybase IQ I suggest setting that up as well. It is a fantasitc columnar database. It also has built in SDA to HANA in the 16 release. SDA comes from Sybase and was rewritten for HANA it was originally called Sybase Component Integration Services and exists in ASE, IQ and SQL Anywhere and now HANA. I am looking forward to additional improvements coming in HANA SP7.

        (0) 
        1. Bill Ramos

          Hi John,

             Thanks for the write up. I’m currently trying to connect to HDP and I’ve got everything almost working with Hana version .62. I was trying out your create virtual table example and I’m getting an error:

          Could not execute ‘CREATE TABLE “WIKIDATA”.”HIVE_pagecountfilter” (“projectcode” VARCHAR(50), “pagename” …’ in 2 ms 67 µs .

          SAP DBTech JDBC: [257] (at 228): sql syntax error: incorrect syntax near “HIVE1”: line 10 col 6 (at pos 228)

          The create table statement looks like this:

          CREATE TABLE “WIKIDATA”.”HIVE_pagecountfilter”

          (“projectcode” VARCHAR(50),

          “pagename” VARCHAR(2000),

          “hitsperhour” BIGINT,

          “bytesperhour” BIGINT,

          “year” VARCHAR(4),

          “month” VARCHAR(2),

          “day” VARCHAR(2),

          “hour” VARCHAR(2)

          ) AT “HIVE1″.”default”.”default”.”pagecountfilter”;

          Using the HANA Studio UI, the virtual table is created – just with undesirable data types.

          Here is the what the I see under the provisioning folder:

          Provisioning > Remote Sources > HIVE1 > HIVE > default > pagecountfilter

          I created the remote source as follows:

          CREATE REMOTE SOURCE HIVE1 ADAPTER “hiveodbc” CONFIGURATION ‘DSN=hive1’

              WITH CREDENTIAL TYPE ‘PASSWORD’ USING ‘user=hive;password=hive’;

          Any ideas – other that a version issue with HANA?

          Regards,

          Bill

          (0) 
          1. Aron MacDonald Post author

            Hi Bill,

            Perhaps a typo above but did you try it with the first line:

            CREATE VIRTUAL TABLE “WIKIDATA”.”HIVE_pagecountfilter”

            Cheers

            Aron

            (0) 
            1. Bill Ramos

              Hi Aron,

                 Adding VIRTUAL – doh. It turns out that the other trick is to match the namespace shown in the Remote Sources. In my case, this was the create table statement that worked.

              CREATE VIRTUAL TABLE “WIKIDATA”.”HIVE_pagecountfilter”

              (“projectcode” VARCHAR(50),

              “pagename” VARCHAR(2000),

              “hitsperhour” BIGINT,

              “bytesperhour” BIGINT,

              “year” VARCHAR(4),

              “month” VARCHAR(2),

              “day” VARCHAR(2),

              “hour” VARCHAR(2)

              ) AT “HIVE1″.”HIVE”.”default”.”pagecountfilter”;

              Regards,

              Bill

              (0) 
              1. Aron MacDonald Post author

                Great stuff. I couldn’t quite get that fair using the latest Cloudera Impala ODBC drivers. The create statement still gives me nasty JDBC errors. I’m also still on HANA Rev 60 so that may not help.

                I don’t like how slow HIVE 0.10 is with SDA  so I’m going to follow you lead and give the Hortonworks Hadoop distribution a shot. I was waiting for Stinger Phase 3 enhancements to HIVE before checking it out but I would like to get this all working again using free ODBC drivers, rather than 15 day trial drivers I used initially. I’m stingy like that. 🙂

                (0) 
                1. Bill Ramos

                  Hi Aron,

                     As I mentioned, HDP has a “free” ODBC driver with their release. I also have some written instructions on how to install the driver on a HANA One instance. As always, the vendors always leave a few mystery steps.

                       I’ve been playing around with a 2.0 AMI from a contact and HortonWorks and there is a very good chance I might be able to share it with folks. It comes with Hive .11 with “Stinger” and has the three months of Wikipedia pagehit data preloaded in three different file formats for comparison.

                  Regards,

                  Bill

                  (0) 
                    1. Ashish Tewari

                      Hi Aron, I am also working on a POC for HADOOP and HANA Integration through SDA. My problem is when I simply right click and create virtual table on HIVE then it brings data types as VARCHAR(255). So I used the code mentioned about to create virtual table. CREATE VIRTUAL TABLE “TEWARIAS”.”ZHIST_SALES” (“customerid” NVARCHAR(4), “material” NVARCHAR(4), “quantity” DECIMAL(4,2), “salesdate” DATE, “salesorder” NVARCHAR(10) ) AT “HIVE1″.”HIVE”.”sales”.”sales_tran”; I am able to create it however when I try to see data (Open Data Preview) I get error: Error: SAP DBTech JDBC: [403]: internal error: Error while fetching a row[Cloudera][Support] (40550) Invalid character value for cast specification. for query “SELECT sales_tran.salesorder, sales_tran.customerid, sales_tran.material, sales_tran.quantity, sales_tran.salesdate FROM HIVE.sales.sales_tran sales_tran ” Any suggestions? Thanks, Ashish

                      (0) 
  8. John Barton

    I should also mention that on another POC I assisted in getting SDA working with Intel Hadoop Distribution IDH. In that case unixODBC 2.3 was installed on the HANA box and that did not work with the simba driver we were using. We backed it out and installed unixODBC 2.2 and that worked!

    (0) 
    1. Tim Forslund

      How did you “back out” the unixODBC drivers/version?  I have been unable to find any documentation about it.  I’ve installed unixODBC 2.2 & 2.2.3 and both appear to be on the system.  The instructions on unixodbc.com only mention “remove all the libraries”.

      I’ve installed unixODBC 2.3.2, simba drivers 1.4 and attempting to connect to Cloudera CDH5 distribution with Hive 0.12.0-cdh5.1.0. HANA is Rev70.  The following error is received:

      isql -v HIVE

      [08001][unixODBC][Simba][ODBC] (10380) Unable to establish a connection with data source. Missing settings: {[PWD] [UID]}

      [ISQL]ERROR: Could not SQLConnect

      Any assistance is much appreciated!

      Tim

      (0) 
  9. Rahul Kabra

    Thanks John, these were very helpful instructions. Thanks for sharing!

    Also, Thank you Aron for this great blog.

    I would just like to add one thing – by default my HiveServer was not running, hence the port 10000 was not listening. As a newbie, i was struggled with this for 1/2 a day. I had manually go to the hive service and add the hiveserver2, after which everything started working.

    Thanks again!

    (0) 
  10. Augusto Arikawa

    Hi Aron,

    We are trying tu use SDA into our HANA System, but we couldn’t find two essential informations for our backlogs:

    1)     1) If we have two identical tables on the external storage and into our HANA system, is that possible to merge the data somehow without needing to virtualize the external table?

    2)     2) How does the process of making a hot data a cold data works? Is there an automatic way to do it?

    A   

    Into our scenario that is relevant because we will need to store data from like 2 or more years ago and be able to access it in some specific occasions.

    (0) 
    1. Aron MacDonald Post author

      I think I’ve heard that the ‘BW on HANA’ Hot & Cold concept, is moving to standalone HANA in SPS07, for Sybase IQ only SDA. I wouldn’t be surprised if they open this up to other sources such as HADOOP, in subsequent releases, but I’ll just have to keep my fingers crossed.

      If you can’t wait then perhaps a Calculation view (using a Union) over your HANA table (presumably with current data), and Virtual Table (old data) would be sufficient.

      Depending on your data volumes you might also need to use input parameters to perform some switching to ensure that, unless ‘OLD’ data is being requested, then the calculation view is only reading the HANA table.

      Your reports could then be built off the new calculation view, which would then work for old and new data, assuming you have tested that the virtual table will return the old data in adequate time.

      (0) 
        1. Aron MacDonald Post author

          No worries.  Thanks for the comment, but I think this post has really been made much more informative by all the  questions and great contributions by everyone.   Cheers to All. 😎

          (0) 
  11. Aron MacDonald Post author

    ——

    ——>  Continued from comments to Bill above.

    Hi Bill,

    If your contacts from Hortonworks can share details of the AWS AMI that’d be great thanks.It’d be interesting to share results. 🙂

    I much prefer using AWS than the virtual machines as you can scale out prototypes easily by adding instances (nodes). As HADOOP continues to take off for massive data storage and retrieval I can’t see a future where organisations are individually setting up farms of servers to run HADOOP clusters.  Instead as Cloud security is better understood and improved then organisations will probably increasingly use providers such as AWS.

    It wouldn’t be surprised if in a few years the SAP Cloud appliance Library might even include a mechanism for launching an Hadoop cluster and scaling it out to help businesses cope with the increasing demands of large data storage an retrieval, integrated directly with HANA. Instant results + Infinite storage truly working together.

    Ok head out of the clouds and back to earth.

    I’ve actually already setup HDP2.0 single node cluster (to start) on AWS, so was just about to get the ODBC/SDA  connection to HANA part, so if you can share your notes that’d be great.

    I’m happy also to provide my steps for setting up HDP2.0 (including HIVE 0.12) on AWS but this probably isn’t the forum.

    Having said that though I think I need to do a bit more reading up on the improvements made to Hive.  I’m probably doing something wrong because so far I’m yet to see any speed improvement.

    After creating a simple table (Key Integer, Field String) with 2 rows (CSV format), I

    ran a simple comparison on ridiculously small 1 node clusters between CDH 4.4    & HDP2.0 (both running on AWS m1.medium). Results as follows:

    Select count(*) from TestTable

    CDH 4.40  (Hive 0.10):      30 seconds

    HDP2.0      (Hive0.12):     40 seconds

    CDH 4.40  (Impala 1.1.1): <1 second

    Obviously I need to do a lot more sophisticated tests than this. Since HADOOP is really meant for Huge amounts of data, then hopefully as I increase my data set (and perhaps adopt tables with ORC format) then Hive0.12 will start to shine. I’ll test that out another  day.

    The fun really starts when seeing the impact of the sizes on HANA Virtual table performance, which you are probably exploring now. 😎

    (0) 
  12. Ganesh Mahadevan Vaidyanathan

    Hi Aron,

    We wanted to do a POC on Hadoop and HANA for just to build a small prototype and capabilities. WE do have a HANA Instance(CISCO – Enterprise Edition Rev 68.

    W.R.T Hadoop, what are the pre-requisites, s/w required (Hadoop, HIVEImpala..etc). Is there any free trial version / starter pack  downloadable?

    Please help.

    Thanks in advance.

    ganesh

    (0) 
    1. Aron MacDonald Post author

      Hi Ganesh,

      I’m pleased to hear you are going to give it a try.

      Regarding Hadoop it really depends on the scope of your PoC.

      I’ve only tried Cloudera and Hortonworks HADOOP distributions. There are pro’s and con’s for both.

      You could start out very small perhaps using an Hadoop virtual machines running on your laptop.

      e.g.

      Hortonworks Sandbox

      Cloudera Virtual Machine

      A better option though is to run HADOOP on dedicate machines, either on-premise or on a cloud.

      CDH

      HDP 2.0 – The complete Hadoop 2.0 distribution for the enterprise


      I’ve done my PoC’s on EC2 because they are easy to scale depending on need (though not the cheapest option).At an absolute minimum you can run Hadoop on a single ec2 m1.medium (4gb ram 100gb storage). Adding additional higher spec nodes is of course better. Its all down to budget and scope.  With the minimum setup you certainly will do not see the true power and potential of HADOOP.

      The smallest production ready cluster might be a 4 node cluster (each with 48Gb Memory).

      I hope that helps.

      Cheers

      Aron

      (0) 
  13. Leo Hu

    Hi Aron

        Could you please tell me how to show the “Terminals” view in HANA Studio.

    Thanks a lot in advance.

    Leo Hu

    (0) 
    1. Aron MacDonald Post author

      Hi Leo,

      To access the Linux Terminal which Hana is running on (presuming you have the appropriate OS level userid/passord) you can use either PUTTY, or my personal favourite is using the ‘Remote System Explorer’  addon to HANA studio.

      I documented the steps for that here:

      Remote System Explorer: Add Secure Shell (SSH) capability to HANA Studio

      In my case I also use it to access my Main node in my Hadoop cluster to do custom Java development and also execute OS level commands.  I like being able to do it all from HANA studio without having to switch between other development tools.

      Kind Regards

      Aron

      (0) 
  14. Ram Lokeswara

    Hello Aron,

    We are using SAP HANA SPS 10, connecting Hadoop using spark controller.

    We are able to connect to Hadoop system. But after creating virtual table, we are unable to see the content. we are getting the below error.

    Could not execute ‘SELECT TOP 1000 * FROM “HDPUSER”.”spark_demo_products”‘ SAP DBTech JDBC: [403]: internal error: Error opening the cursor for the remote database for query “SELECT “spark_demo_products”.”productid”,

    SAP HANA SDA.png

    Could you please let us know what is this error? Is there any config or authentication to resolve this issue.

    Please help us to fix this one.

    Thanks in advance.

    Best Regards,

    Ram

    (0) 

Leave a Reply