Skip to Content

For those that need a rich tool for managing data flows between SAP and NON SAP systems (supported by SAP), then first stop will probably be the SAP Data Services tool, formerly known as BODS.

Here are a couple of useful links:

https://help.sap.com/bods

Getting ready for Big Data with Data Services Webcast – Part 1

Configuring Data Services and Hadoop – Enterprise Information Management – SCN Wiki

With that said though, opensource HADOOP also has a tool for moving large amounts of data between HADOOP and RDBMS systems, known as SQOOP

If you need support beyond your own IT organisation then the leading HADOOP vendors (such a Cloudera and Hortonworks) offer support contracts, and will presumably enable you to enhance the tool as required.

Sqoop currently has 2 flavours  Version 1 & Version 2 (which is almost Production ready).

Version 1 is a command line tool,  which has been integrated with OOZIE to enable SQOOP to be easily used within an HADOOP workflow.

Version 2 has enhanced security, UI support but isn’t yet integrated with OOZIE yet (but is apparently in the pipeline).  As a workaround it theoretically it can be used with OOZIE now if incorporated within Shell scripts or a wrapper JAVA program.

To demonstrate sqoop I first need to create some test data in HANA:





SQOOP2 (Version 2)


Sqoop2  has a new UI which has been added to the HADOOP User Interface (HUE)  from version 2.5 onwards.


The HUE Website Hue – Hadoop User Experience – The Apache Hadoop UI – Tutorials and Examples for Hadoop, HBase, Hive, Impala, Oozie, Pig… has some nice videos demonstrating it’s features  


NOTE: To use SQOOP2 with HANA you first need to have copied the the HANA JDBC drivers ngdbc.jar (from HANA Client download)  to the SQOOP2 directory on your HADOOP cluster (e.g. /var/lib/sqoop2)


At the bottom of the below HUE screen capture you can see I’ve created 2 jobs for IMPORTING and EXPORTING from SAP




When you create the first job to HANA you will need to create a connection , which can be share with subsequent jobs:


Add new connection:




With a connection to HANA created then the Job can be defined.



First lets define and run an IMPORT from SAP to HDFS

Note: the ‘Extractors’ section enables the data to be extracted in parallel (in this case 5 parallel tasks)


Click Save and Run.


(I’ve skipped the detailed logging screens)



Finally the data is downloaded to HADOOP in 5 separate files (representing the 5 parallel task).

One of the task files is:


Don’t worry about the files being split like this,  with funny names, HADOOP loves it like this. 🙂


These files can now be very easily used by HADOOP  HIVE or PIG etc. for Batch processing OR combined with HANA Smart Data Access to be brought back into HANA as a Virtual Table.

Smart Data Access with HADOOP HIVE & IMPALA

Now lets repeat the process in reverse to load back into HANA into a different table.


NOTE: SQOOP does NOT have complex field mapping rules, so SOURCE and TARGET must have same column structure.

If you need complex mapping rules then you might be best of use SAP Data services.

Alternatively you could use HADOOP PIG to first reformat the data into the correct TARGET format, prior to using SQOOP.


Now lets define and run an EXPORT from HDFS to SAP



After ‘Save and Run‘ we then have the following results in HANA:




Pretty Easy stuff really.


Once Sqoop2 is officially production ready then it’s definitely worth doing a bit more stress testing with.



SQOOP1 (Version 1)

Sqoop1 is a command line which should achieve similar results

The following statements are used:

Import from HANA:

sqoop import –username SYSTEM –password xxxyyyy –connect jdbc:sap://xxx.xxx.xxx.xxx:30015/ –driver com.sap.db.jdbc.Driver –table HADOOP.HANATEST1 –target-dir /user/sqoop2/ht1001 –split-by id

NOTE: I’m not sure if just my HADOOP setup yet but the Sqoop1 imports fails for me with the following error ‘java.io.IOException: SQLException in nextKeyValue‘.    For the moment I’m happy with Sqoop2for imports so I’m not that fussed to investigate, but it anyone has the answer then I welcome the feed back.

Export to HANA:

sqoop export -D sqoop.export.records.per.statement=1 –username SYSTEM –password xxxxyyyy –connect jdbc:sap://xxx.xxx.xxx.xxx:30015/ –driver com.sap.db.jdbc.Driver –table HADOOP.HANATEST2 –export-dir /user/admin/HANATEST1_SQOOP1

Sqoop1 Export works for me. The results in HANA were:

NOTE: Sqoop1 and Sqoop2 appear to  handle strings slightly differently when exporting and importing so you just need to be careful with your format.

Sqoop1s advantage over Sqoop2 is that the command line can be easily added to an OOZIE to enable a full workflow scenario to be processed.  For a bit more details on using OOZIE with HANA then see Creating a HANA Workflow using HADOOP Oozie

Do give Sqoop a try (which ever flavour) and let me know how you get on. 😀

To report this post you need to login first.

15 Comments

You must be Logged on to comment or reply to a post.

  1. Justin Molenaur

    Aron, can you shed any light on the throughput performance you have observed with larger data volumes HANA->Sqoop->Hadoop?

    Per this post The SAP HANA Engines? I am looking at a similar topic and seeing some interesting behavior. Have you by chance monitored what HANA is actually doing during these Sqoop reads? My examples are provided using Sqoop v1.4, so I am not sure how the number of parallel threads and connection details play into this too.

    In the table definition window, it looks like you have an option to specify an additional SQL command, any idea if this can be dynamic, ie pull past 60 days of data or similar?

    Regards,

    Justin

    (0) 
    1. Aron MacDonald Post author

      Hi Justin,

      It’s fantastic to hear other people are trying to use Sqoop now too. 🙂

      For details on Parallel loads and customizing the ‘Where’ clause, with Sqoop v1.4,  please see Sqoop User Guide (v1.4.3) section 7.2.3 & 7.2.4 .

      Arguably the most important point is choosing the appropriate column on your source table to split the load.

      I think you are restricted to using Numeric columns for the split (e.g. Integer in my example).

      The values should be sequential with minimal skips in the ranges, to help balance the extraction.

      By default it will use the min and max values of the column your define and then split the extraction into ranges according to the number of parallel jobs you define.

      I think you can also only run in parallel up to the max number of Mappers you have on your Hadoop cluster.

      Here is an example of what the Hana SQL plan cache looks like in my blog example:


      If I add one extra record:

      insert into HADOOP.HANATEST1 values (15,’OOO’,15.15);

      NOTE: I’ve now skipped records with id 11-14


      Now when I run the SQOOP it splits as follows:


      NOTE: these screencaps are taken using SQOOP 2, but I think the logic is similar in 1.4 though the command line syntax slightly different.

      I’ve still not tried in anger to get the SQOOP V1.4 Import working, so if you have any details of any special steps you did to get it to work then please let me know.

      Kind Regards

      Aron

      (0) 
  2. Bharath Kumar Bachu

    Hi All,

    I have just started learning Hadoop and trying the same example of loading HANA data into HDFS using Sqoop. could somebody help me in installing the  driver com.sap.db.jdbc.Driver on cloudera machine?

    I have copied the sapdbc.jar to /usr/lib/sqoop/lib but not sure what to do after this step.

    Any help would be highly appreciated.

    Thanks,

    Bharath.

    (0) 
    1. Aron MacDonald Post author

      I didn’t try putting files in that directory, see above for the directory I put it in for SQOOP2.

      You may need to check the SQOOP documentation or your own cluster setup for the appropriate path.

      Once you’ve copied to the corerct location, please check your ngdbc.jar file persimmons and ensure your restart the SQOOP service.

      Also depending on what you are doing it may also make sense to copy it to a shared directory on HDFS, such as /user/admin/shared-jar

      Cheers

      Aron

      (0) 
      1. Bharath Kumar Bachu

        Hi Aron,

        Thank you very much for the blog and your response. I was able to install the driver and tried to load a table from HANA which has the Integer,Nvarchar,Timestamp and Decimal datatypes.

        when i execute the query i am getting the following error. Have you ever faced such issue? Any idea how to fix this?

        Query:

        cloudera@cloudera-vm:/usr/lib/sqoop/bin$ sqoop –options-file

        HANAconnection.txt –query ‘SELECT * FROM AGENT_SCORECARD.ASC_RAW_DATA

        WHERE YEAR = 2014 AND MONTH =5 AND  $CONDITIONS’ –split-by AGENT_NUM

        –target-dir /user/tmp/hana/Agent_Scorecard


        Error:

        14/07/01 09:43:11 WARN tool.BaseSqoopTool: Setting your password on

        the command-line is insecure. Consider using -P instead.

        14/07/01 09:43:11 INFO tool.CodeGenTool: Beginning code generation

        14/07/01 09:43:11 INFO manager.SqlManager: Executing SQL statement:

        SELECT * FROM AGENT_SCORECARD.ASC_RAW_DATA  WHERE YEAR = 2014 AND

        MONTH =5 AND   (1 = 0)

        14/07/01 09:43:11 INFO manager.SqlManager: Executing SQL statement:

        SELECT * FROM AGENT_SCORECARD.ASC_RAW_DATA  WHERE YEAR = 2014 AND

        MONTH =5 AND   (1 = 0)

        14/07/01 09:43:11 ERROR orm.ClassWriter: Cannot resolve SQL type -9

        14/07/01 09:43:11 ERROR orm.ClassWriter: No Java type for SQL type -9

        14/07/01 09:43:11 ERROR sqoop.Sqoop: Got exception running Sqoop:

        java.lang.NullPointerException

        java.lang.NullPointerException

                at com.cloudera.sqoop.orm.ClassWriter.parseNullVal(ClassWriter.java:745)

                at com.cloudera.sqoop.orm.ClassWriter.parseColumn(ClassWriter.java:770)

                at com.cloudera.sqoop.orm.ClassWriter.generateParser(ClassWriter.java:844)

                at com.cloudera.sqoop.orm.ClassWriter.generateClassForColumns(ClassWriter.java:1125)

                at com.cloudera.sqoop.orm.ClassWriter.generate(ClassWriter.java:990)

                at com.cloudera.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)

                at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:337)

                at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:423)

                at com.cloudera.sqoop.Sqoop.run(Sqoop.java:144)

                at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

                at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

                at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:180)

                at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:218)

                at com.cloudera.sqoop.Sqoop.main(Sqoop.java:228)

        (0) 
        1. Aron MacDonald Post author

          Hi, 

          I’ve still not tried sqoop version 1 for importing to HADOOP from HANA.

          However with SQOOP you will have to be more careful on dealing with SQL types.

          Have you successfully imported a simple HANA table yet, without  using splitting?

          If you haven’t already I suggest you start small and gradually increase the scope of columns/data types and rows you are extracting from HANA.

          If you eventually find that a particular SQL data type is causing the error with sqoop 1  then please add a comment below and I will take a look and see if I can suggest a work around.

          Cheers

          Aron

          (0) 
  3. Blair Krotenko

    Hi Aron,

    Thanks for the post.

    I had to make a couple of changes to get this to work in my test system. I copied the JDBC driver to /usr/lib/sqoop/lib/ as per:

    Installing the JDBC Drivers

    Also, I had to use a partition column in the step “First lets define and run an IMPORT from SAP to HDFS“. That column specified has to be an Integer data type in the source table. I kept getting “GENERIC_JDBC_CONNECTOR:The type is not supported – 9” until I figured this out.

    Thanks,

    Blair

    (0) 
  4. Martin Maruskin

    Hi Aron,

    Do I understand it correctly that SQOOP leverages HANA JDBC connector while interacting with HANA? Can you please confirm my assumption?

    Also what is your experience with performance of SQOOP while loading data to HANA?

    thanks

    m./

    (0) 
  5. Ankit .

    Hi Aron,

         Firstly, thank you so much for this document on sqoop. It helped me a lot. I did not have the option to use sqoop2, so tried using sqoop1 method and was finally successful. During sqoop import from a particular schema, I faced the same error that you mentioned in the blog. The error in my case was caused by “incorrect syntax near ‘.’ “ While giving the argument, I initially gave it as SCHEMANAME.TABLENAME and then changed it to TABLENAME.

    To pass the schema, I modified the connect argument to :-

    –connect jdbc:sap://inll50838206a.apj.global.corp.sap:30015/?currentschema=SCHEMANAME

    Somehow sqoop import was not able to parse the “.” between Schema and table name and it worked properly after the above changes.

    The final import command that I used was something like this –

    sqoop import –username ADMIN –password **** –connect jdbc:sap://inll******.apj.global.corp.sap:30015/?currentschema=SCHEMANAME  –driver com.sap.db.jdbc.Driver –table TABLENAME –hive-import –direct

    Giving –hive-import is optional, depending on your requirement.

    Thanks,

    Ankit

    (0) 
  6. manda mahesh

    Hi Aron,

    Im mahesh  working at TCS and  currently we have to integrate with SAP HANA and HADOOP in my current project, i dont know much aware of how integrate with hadoop+hana  so Please send me useful documents that very useful to me as soon as possible

    Thanks

    mahesh

    (0) 
  7. Bhaskar N

    Hi Aron MacDonald,

    Thank you for your blog, nice explanation.

    I have done as per you blog, I am able to connect with database but can’t able to fetch the table. I am getting Jdbc connection timed out error.

    Jdbc Issue.jpg

    Can you please help me from here.

    Thank you in advance,

    Bhaskar N.

    (0) 
  8. Edwin Jerome

    Hi Aron,

         I have question regarding import from HANA. By the way I’m new to HANA, but I’m familiar with HADOOP. Already done ETL using SQOOP from Oracle, Mysql, SQL server.

         Here I’m trying to connect HANA with HADOOP using SQOOP. I can able to connect with it like I can able to list tables, creating tables, select query etc. But when I try to fetch the data from HANA, I’m facing timed out problem as follows:

    Caused by: com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: Cannot connect to jdbc:sap://XXX.XXX.XX.XX:30015/HDM [Cannot connect to host XXX.XXX.XX.XX:30015 [Connection timed out], -813.].


         Also the mapreduce process not starting. I got suggestion like increase the MAXUSERTASKS to some values(from link: Cannot connect to jdbc:sapdb Cannot connect to host localhost:7210, Cannot connect to jdbc:sapdb://<host>/<DB> [Time out]), so that you can solve this problem like that. But I don’t know how to increase the MAXUSERTASKS parameter. Or I’ve do anyother. Can you help me in this?

         I’m facing this problem for a long. I want to solve this problem. Please suggest me, how to solve this issue?

    (0) 

Leave a Reply