Technical Articles
Importing tables from a HANA system into HDFS using Apache Sqoop
This blog is to discuss a way of importing data from an external SAP system with HANA SP10 installed to a system with Red Hat Enterprise Linux using Apache Sqoop. RHEL is the system on which the user is currently logged in using Putty. It can be downloaded from the below link:
Download PuTTY – a free SSH and telnet client for Windows
Version of Sqoop used is – 1.4.6.2.3.0.0-2557
What happens when a sqoop import is run:
1. Sqoop raises a request for metadata information from the RDBMS system which has the tables in it. JDBC API is used to get the metadata information.
2. The RDBMS system returns the metadata information to the host system. A .java file is created for each table to be imported.
3. The java file is compiled to a .class file and then to a .jar file. The metadata includes the primary key information for each table using which sqoop tries to get the row information from the tables.
4. Data types used in the tables are then converted to respective nearest compatible data types in the HDFS system and imported in the host system.
Internal Commands run during the import:
1. select min(primary_key), max(primary_key) from table_name.
2. select *from table_name where primary_key between min_value and max_val.
SQOOP IMPORT –
1. First step is to copy the ngdbc.jar file from the HANA client directory(C:\Program Files\SAP\hdbclient) to the lib folder inside SQOOP home directory. If you installed Sqoop using Hortonworks, the sqoop directory would be /usr/hdp/hdp-version/sqoop.
2. Change to hdfs user:
su hdfs
where hdfs is the default used of your hadoop system.
3. sqoop import –username <USERNAME> –password **** –connect jdbc:sap://inllxyz12345.abc.global.corp.sap:30015/ –driver com.sap.db.jdbc.Driver –table TABLE_NAME –hive-import –direct
Above command imports a table named TABLE_NAME from the default schema ADMIN to HDFS.
inllxyz12345.abc.global.corp.sap:3xx15 is given as the system name where xx is the instance number.
–hive-import -> imports the tables into hive as well as a default directory in the HDFS system.
The table would be imported to hdfs://inllxyz123.abc.global.corp.sap:8020/user/hdfs/SOMETABLENAME
This can be accessed using hadoop fs -ls command
In case the import did not complete, you will have to delete the file created and start import again after resolving issues. Below command is helpful –
hadoop dfs -rm -r hdfs://path/to/file
4. The import was required to be performed from a particular schema, I used SCHEMA_NAME.TABLE_NAME as the –table argument. It gave me the following error –
SQLException in nextKeyValue
Caused by: com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [257]: sql syntax error: incorrect syntax near “.”
Solution: Use the argument currentschema in the host address to make it work as shown below –
sqoop import –username <USERNAME> –password **** –connect jdbc:sap://<host address>/?currentschema=SCHEMA_NAME –driver com.sap.db.jdbc.Driver –table TABLE_NAME –hive-import –direct
5. To check if the tables have been moved properly into HIVE, open hive shell:
> hive
> show tables;
displays list of tables
> select * from TABLE_NAME;
Other important points/problems and their solutions:
1. In case table name starts with a numeric digit.
Solution – Change the table name to contain only alphabets.
2. If the table name has small case letters.
Solution – The table name should only have capital letters.
3. If no suitable type conversion was possible for a column from the RDBMS to the HDFS.
Solution – Convert the RDBMS column type to a simpler type and try if the import works successfully.
4. DATE type in the RDBMS systems may not be proper.
Solution – Convert the DATE type to a string before importing into the Hadoop system.
5. In case the table does not have a primary key, use split-by argument instead as below:
sqoop import –username <USERNAME> –password **** –connect jdbc:sap://<host address>/?currentschema=SCHEMA_NAME –driver com.sap.db.jdbc.Driver –table TABLE_NAME –hive-import –split-by ID
Here ID is the column name which is used as primary key.
6. If you want to perform parallel map operations so that huge tables are imported in comparatively less time, use “-m n” argument as below:
sqoop import –username <USERNAME> –password **** –connect jdbc:sap://<host address>/?currentschema=SCHEMA_NAME –driver com.sap.db.jdbc.Driver –table TABLE_NAME –hive-import –split-by ID -m 8
8 parallel threads are run to perform the import.
7. When running the import I also faced below error –
Encountered IOException running import job: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode=”/user/root/.staging”:hdfs:hdfs:drwxr-xr-x
Change your user to hdfs (default user for hadoop system).
8. Missing directory error (don’t remember exactly but some issue related to missing directory was faced)
Solution:
sudo -u hdfs hadoop fs -mkdir /user/root
sudo -u hdfs hadoop fs -chown root:root /user/root
For any other information related to SQOOP, below user guide can be referred –
https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
Great Insights Ankit, Thanks for detailed information It really helped me in setting up and using the sqoop. I was struggling with this from past some days.
Can you please share some performance matrix like when you import large tables how fast it is.
Your most welcome Ashish.
I will surely be sharing some performance matrix when I import large tables using this. 🙂
Thanks ,but what could be the business case or requirement for importing data from SAP HANA to HADOOP ?
HADOOP import was required because our use case required huge amount of data to be kept as a cold storage which won't be accessed frequently. 🙂
ok . So can HADOOP be used to stored the frozen data in HADOOP for any SAP ERP /SAP BW systems also? Also is this type of data storing in HADOOP allowed by SAP?
Ankit ,
i know it is a old thread. I am having trouble running the import .
I am getting the following error
16/08/17 18:29:51 INFO client.RMProxy: Connecting to ResourceManager at ec2-52-52-4-177.us-west-1.compute.amazonaws.com/172.31.1.24:8032
16/08/17 18:29:53 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/17 18:29:53 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/ubuntu/.staging/job_1471456795876_0003
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.sqoop.config.ConfigurationHelper.getJobNumMaps(ConfigurationHelper.java:65)
at com.cloudera.sqoop.config.ConfigurationHelper.getJobNumMaps(ConfigurationHelper.java:36)
at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:196)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:169)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:266)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
I copied the ngdbc.jar to SQOOP_LIB directory My HANA verison is SP10. Apache HADOOP 2.7.1 , sqoop 1.4.6
the command i used is
/usr/local/sqoop/lib$ $SQOOP_HOME/bin/sqoop import --username SYSTEM -P --connect jdbc:sap://hostname :30015/ --driver com.sap.db.jdbc.Driver --table TABLE_NAME --target-dir /user/hana/data/TABLE_NAME
Any help is appreciated .
Murali