Skip to Content
Technical Articles
Author's profile photo Ankit .

Importing tables from a HANA system into HDFS using Apache Sqoop

This blog is to discuss a way of importing data from an external SAP system with HANA SP10 installed to a system with Red Hat Enterprise Linux using Apache Sqoop. RHEL is the system on which the user is currently logged in using Putty. It can be downloaded from the below link: 
Download PuTTY – a free SSH and telnet client for Windows

Version of Sqoop used is – 1.4.6.2.3.0.0-2557

What happens when a sqoop import is run:

1. Sqoop raises a request for metadata information from the RDBMS system which has the tables in it. JDBC API is used to get the metadata information.

2. The RDBMS system returns the metadata information to the host system. A .java file is created for each table to be imported.

3. The java file is compiled to a .class file and then to a .jar file. The metadata includes the primary key information for each table using which sqoop tries to get the row information from the tables.

4. Data types used in the tables are then converted to respective nearest compatible data types in the HDFS system and imported in the host system.

 

Internal Commands run during the import:

1. select min(primary_key), max(primary_key) from table_name.

2. select *from table_name where primary_key between min_value and max_val.

 

SQOOP IMPORT –

1. First step is to copy the ngdbc.jar file from the HANA client directory(C:\Program Files\SAP\hdbclient) to the lib folder inside SQOOP home directory. If you installed Sqoop using Hortonworks, the sqoop directory would be /usr/hdp/hdp-version/sqoop.

 

2. Change to hdfs user:

su hdfs

where hdfs is the default used of your hadoop system.

 

3. sqoop import –username <USERNAME> –password **** –connect jdbc:sap://inllxyz12345.abc.global.corp.sap:30015/ –driver com.sap.db.jdbc.Driver –table TABLE_NAME  –hive-import –direct

 

Above command imports a table named TABLE_NAME from the default schema ADMIN to HDFS.

inllxyz12345.abc.global.corp.sap:3xx15 is given as the system name where xx is the instance number.

 

–hive-import  -> imports the tables into hive as well as a default directory in the HDFS system.

The table would be imported to hdfs://inllxyz123.abc.global.corp.sap:8020/user/hdfs/SOMETABLENAME

This can be accessed using hadoop fs -ls command

 

In case the import did not complete, you will have to delete the file created and start import again after resolving issues. Below command is helpful –

hadoop dfs -rm -r hdfs://path/to/file

 

4. The import was required to be performed from a particular schema, I used SCHEMA_NAME.TABLE_NAME as the –table argument. It gave me the following error –

SQLException in nextKeyValue

Caused by: com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [257]: sql syntax error: incorrect syntax near “.”

Solution: Use the argument currentschema in the host address to make it work as shown below –

sqoop import –username <USERNAME> –password **** –connect jdbc:sap://<host address>/?currentschema=SCHEMA_NAME –driver com.sap.db.jdbc.Driver –table TABLE_NAME –hive-import –direct

5. To check if the tables have been moved properly into HIVE, open hive shell:

> hive

> show tables;

displays list of tables

> select * from TABLE_NAME;

Other important points/problems and their solutions:

1. In case table name starts with a numeric digit.

Solution – Change the table name to contain only alphabets.

 

2. If the table name has small case letters.

Solution – The table name should only have capital letters.

 

3. If no suitable type conversion was possible for a column from the RDBMS to the HDFS.

Solution – Convert the RDBMS column type to a simpler type and try if the import works successfully.

 

4. DATE type in the RDBMS systems may not be proper.

Solution – Convert the DATE type to a string before importing into the Hadoop system.

 

5. In case the table does not have a primary key, use split-by argument instead as below:

sqoop import –username <USERNAME> –password **** –connect jdbc:sap://<host address>/?currentschema=SCHEMA_NAME –driver com.sap.db.jdbc.Driver –table TABLE_NAME –hive-import –split-by ID

Here ID is the column name which is used as primary key.

 

6. If you want to perform parallel map operations so that huge tables are imported in comparatively less time, use “-m n” argument as below:

sqoop import –username <USERNAME> –password **** –connect jdbc:sap://<host address>/?currentschema=SCHEMA_NAME –driver com.sap.db.jdbc.Driver –table TABLE_NAME –hive-import –split-by ID -m 8

8 parallel threads are run to perform the import.

 

7. When running the import I also faced below error –

Encountered IOException running import job: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode=”/user/root/.staging”:hdfs:hdfs:drwxr-xr-x

Change your user to hdfs (default user for hadoop system).

 

8. Missing directory error (don’t remember exactly but some issue related to missing directory was faced)

Solution:

sudo -u hdfs hadoop fs -mkdir /user/root

sudo -u hdfs hadoop fs -chown root:root /user/root

 

For any other information related to SQOOP, below user guide can be referred –

https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html

Assigned Tags

      6 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      Great Insights Ankit, Thanks for detailed information It really helped me in setting up and using the sqoop. I was struggling with this from past some days.

       

           Can you please share some performance matrix like when you import large tables how fast it is.

      Author's profile photo Ankit .
      Ankit .
      Blog Post Author

      Your most welcome Ashish.
      I will surely be sharing some performance matrix when I import large tables using this. 🙂

      Author's profile photo Former Member
      Former Member

      Thanks ,but what could be the business case or requirement for importing data from SAP HANA to HADOOP ?

      Author's profile photo Ankit .
      Ankit .
      Blog Post Author

      HADOOP import was required because our use case required huge amount of data to be kept as a cold storage which won't be accessed frequently. 🙂

      Author's profile photo Former Member
      Former Member

      ok . So can HADOOP be used to stored the frozen data in HADOOP for any SAP ERP /SAP BW systems also? Also is this type of data storing in HADOOP allowed by SAP?

      Author's profile photo Former Member
      Former Member

      Ankit ,

      i know it is a old thread. I am having trouble running the import .

      I am getting the following error

      16/08/17 18:29:51 INFO client.RMProxy: Connecting to ResourceManager at ec2-52-52-4-177.us-west-1.compute.amazonaws.com/172.31.1.24:8032

      16/08/17 18:29:53 INFO db.DBInputFormat: Using read commited transaction isolation

      16/08/17 18:29:53 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/ubuntu/.staging/job_1471456795876_0003

      Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

              at org.apache.sqoop.config.ConfigurationHelper.getJobNumMaps(ConfigurationHelper.java:65)

              at com.cloudera.sqoop.config.ConfigurationHelper.getJobNumMaps(ConfigurationHelper.java:36)

              at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:125)

              at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)

              at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)

              at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)

              at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)

              at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)

              at java.security.AccessController.doPrivileged(Native Method)

              at javax.security.auth.Subject.doAs(Subject.java:415)

              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

              at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)

              at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)

              at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:196)

              at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:169)

              at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:266)

              at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)

              at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)

              at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)

              at org.apache.sqoop.Sqoop.run(Sqoop.java:143)

              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

              at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)

              at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)

              at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)

              at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

      I copied the ngdbc.jar to SQOOP_LIB directory My HANA verison is SP10. Apache HADOOP 2.7.1 , sqoop 1.4.6

      the command i used is

      /usr/local/sqoop/lib$ $SQOOP_HOME/bin/sqoop import --username SYSTEM -P  --connect jdbc:sap://hostname :30015/ --driver com.sap.db.jdbc.Driver --table TABLE_NAME   --target-dir /user/hana/data/TABLE_NAME

      Any help is appreciated .

      Murali