Exporting and Importing DATA to HANA with HADOOP SQOOP
For those that need a rich tool for managing data flows between SAP and NON SAP systems (supported by SAP), then first stop will probably be the SAP Data Services tool, formerly known as BODS.
Here are a couple of useful links:
With that said though, opensource HADOOP also has a tool for moving large amounts of data between HADOOP and RDBMS systems, known as SQOOP
If you need support beyond your own IT organisation then the leading HADOOP vendors (such a Cloudera and Hortonworks) offer support contracts, and will presumably enable you to enhance the tool as required.
Sqoop currently has 2 flavours Version 1 & Version 2 (which is almost Production ready).
Version 1 is a command line tool, which has been integrated with OOZIE to enable SQOOP to be easily used within an HADOOP workflow.
Version 2 has enhanced security, UI support but isn’t yet integrated with OOZIE yet (but is apparently in the pipeline). As a workaround it theoretically it can be used with OOZIE now if incorporated within Shell scripts or a wrapper JAVA program.
To demonstrate sqoop I first need to create some test data in HANA:
SQOOP2 (Version 2)
Sqoop2 has a new UI which has been added to the HADOOP User Interface (HUE) from version 2.5 onwards.
The HUE Website Hue – Hadoop User Experience – The Apache Hadoop UI – Tutorials and Examples for Hadoop, HBase, Hive, Impala, Oozie, Pig… has some nice videos demonstrating it’s features
NOTE: To use SQOOP2 with HANA you first need to have copied the the HANA JDBC drivers ngdbc.jar (from HANA Client download) to the SQOOP2 directory on your HADOOP cluster (e.g. /var/lib/sqoop2)
At the bottom of the below HUE screen capture you can see I’ve created 2 jobs for IMPORTING and EXPORTING from SAP
When you create the first job to HANA you will need to create a connection , which can be share with subsequent jobs:
Add new connection:
With a connection to HANA created then the Job can be defined.
First lets define and run an IMPORT from SAP to HDFS
Note: the ‘Extractors’ section enables the data to be extracted in parallel (in this case 5 parallel tasks)
Click Save and Run.
(I’ve skipped the detailed logging screens)
Finally the data is downloaded to HADOOP in 5 separate files (representing the 5 parallel task).
One of the task files is:
Don’t worry about the files being split like this, with funny names, HADOOP loves it like this. 🙂
These files can now be very easily used by HADOOP HIVE or PIG etc. for Batch processing OR combined with HANA Smart Data Access to be brought back into HANA as a Virtual Table.
Now lets repeat the process in reverse to load back into HANA into a different table.
NOTE: SQOOP does NOT have complex field mapping rules, so SOURCE and TARGET must have same column structure.
If you need complex mapping rules then you might be best of use SAP Data services.
Alternatively you could use HADOOP PIG to first reformat the data into the correct TARGET format, prior to using SQOOP.
Now lets define and run an EXPORT from HDFS to SAP
After ‘Save and Run‘ we then have the following results in HANA:
Pretty Easy stuff really.
Once Sqoop2 is officially production ready then it’s definitely worth doing a bit more stress testing with.
SQOOP1 (Version 1)
Sqoop1 is a command line which should achieve similar results
The following statements are used:
Import from HANA:
sqoop import –username SYSTEM –password xxxyyyy –connect jdbc:sap://xxx.xxx.xxx.xxx:30015/ –driver com.sap.db.jdbc.Driver –table HADOOP.HANATEST1 –target-dir /user/sqoop2/ht1001 –split-by id
NOTE: I’m not sure if just my HADOOP setup yet but the Sqoop1 imports fails for me with the following error ‘java.io.IOException: SQLException in nextKeyValue‘. For the moment I’m happy with Sqoop2for imports so I’m not that fussed to investigate, but it anyone has the answer then I welcome the feed back.
Export to HANA:
sqoop export -D sqoop.export.records.per.statement=1 –username SYSTEM –password xxxxyyyy –connect jdbc:sap://xxx.xxx.xxx.xxx:30015/ –driver com.sap.db.jdbc.Driver –table HADOOP.HANATEST2 –export-dir /user/admin/HANATEST1_SQOOP1
Sqoop1 Export works for me. The results in HANA were:
NOTE: Sqoop1 and Sqoop2 appear to handle strings slightly differently when exporting and importing so you just need to be careful with your format.
Sqoop1s advantage over Sqoop2 is that the command line can be easily added to an OOZIE to enable a full workflow scenario to be processed. For a bit more details on using OOZIE with HANA then see Creating a HANA Workflow using HADOOP Oozie
Do give Sqoop a try (which ever flavour) and let me know how you get on. 😀