Skip to Content
Author's profile photo Arne Weitzel

Connecting SAP DataServcies to Hadoop: HDFS vs. Hive

SAP DataServices (DS) supports two ways of accessing data on a Hadoop cluster:

  1. HDFS:
    DS reads directly HDFS files from Hadoop. In DataServices you need to create HDFS file formats in order to use this setup. Depending on your dataflow DataServices might read the HDFS file directly into the DS engine and then handle the further processing in its own engine.

    If your dataflow contains more logic that could be pushed down to Hadoop, DS may as well generate a Pig script. The Pig script will then not just read the HDFS file but also handle other transformations, aggregations etc. from your dataflow.

    The latter scenario is usually a preferred setup for large amount of data because this way the Hadoop cluster can provide processing power of many Hadoop nodes on inexpensive commodity hardware. The pushdown of dataflow logic to Pig/Hadoop is similar to the pushdown to relational database systems. It depends on whether the dataflow uses functions that could be processed in the underlying systems (and of course whether DS knows about all the capabilities in the underlying system). In general, the most common functions, joins and aggregations in DS should be eligible to be pushed down to Hadoop.

  2. Hive / HCatalog:
    DS reads data from Hive. Hive accesses data that is defined in HCatalog tables. In DataServices you need to setup a datastore that connects to a Hive adapter. The Hive adapter will in turn read tables from Hadoop by accessing a hive server. DS will generate HiveQL commands. HiveQL is similar to SQL. Therefore the pushdown of dataflow logic to Hive works similar as the pushdown to relational database systems.

It is difficult to say which approach better to use in DataServices: HDFS files/Pig or Hive? Both, Pig and Hive generate MapReduce jobs in Hadoop and therefore performance should be similar. Nevertheless, some aspects can be considered before deciding which way to go. I have tried to describe these aspects in the table below. They are probably not complete and they overlap, so in many cases they will not identify a clear favorite. But still the table below may give some guidance in the decision process.

Subject HDFS / Pig Hive



Simple via HDFS file formats Not simple. Hive adapter and Hive datastore need to be setup. The CLASSPATH setting for the hive adapter is not trivial to determine. Different setup scenarios for DS 4.1 and 4.2.2 or later releases (see also Connecting SAP DataServices to Hadoop Hive)



  • Native HDFS access is only advisable if all the data in Hadoop necessarily need to be processed within DataServices or if the data volume is not too large.
  • Pig covers more a data flow (in a general sense, not to confuse with DataServices dataflows). A pig script is more like a script that processes various transformation steps and writes the results into a target table/file. Therefore a pig script might suit well for DataServices jobs that read, transform and write back data from/to Hadoop.
Hive are queries mainly intended for DWH-like queries. They might suit well for DataServices jobs that need to join and aggregate large data in Hadoop and write the results into pre-aggregated tables in a DWH or some other datastores accessible for BI tools.



  1. Loading flat files from HDFS without any further processing: faster than Hive.
  2. Mapping, joining or aggregating large amount of data (presumed logic gets pushed down to Pig): performance is determined via MapReduce jobs in Hadoop – therefore similar to Hive.
  3. Mapping, joining or aggregating small amount of data: the processing might even run faster in the DS engine. Therefore it might be an option to force DS not to push down the logic to Pig and thus just read the HDFS file natively.
  1. Loading all data of a table without processing / aggregation: slower than native HDFS, because of unnecessary setup of MapReduce jobs in Hadoop.
  2. Mapping, joining or aggregating large amount of data (presumed logic gets pushed down to Hive): performance is determined via MapReduce jobs in Hadoop . therefore similar to Pig.
  3. Mapping, joining or aggregating small amount of data: there is no way to bypass Hive/HiveQL so there will always be some MapReduce jobs initiated by Hadoop, even if the data amount is small. The overhead of initiating these MapReduce jobs usually takes some time. It may overweight the performance of the data processing itself if the data amount is small.



  • HDFS File format need to be defined manually in order to define the data structure in the file
  • On the other hand the HDFS file format can easily be generated from the Schema Out view in a DataServices transform (in the same way as for local file formats)
  • No data preview available
  • HCatalog tables can be imported like database tables. The table structure is already predefine by HCatalog The HCatalog table might already exist, otherwise it still needed to be specified in Hadoop.
  • Template tables do not work with Hive datastores
  • From DS version 4.2.3 and later data can be pre-viewed in DS Designer
Data structure In general, HDFS file formats will suit better for unstructured or semi-structured data. There is little benefit from accessing unstructured data via HiveQL, because Hive will first save the results of a HiveQL query into a directory on the local file system of the DS engine. DS will then read the results from that file for further processing. Instead, reading the data directly via a HDFS file might be quicker.
Future technologies: Stinger, Impala, Tez
Some data access technologies will improve the performance of HiveQL or Pig Latin significantly (some already do so today). Most of them will support HiveQL whereas some of them will support both, HiveQL and Pig. The decision between Hive and Pig might also depend on the (future) data access engines in the given Hadoop environment.

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      Hello Arne,

      I am trying to configure BODS and Hadoop. I have read you fantastic blog about that. Actually I am facing problems with the HDFS configuration. In designer I can create a HDFS file format and I can see the data inside it. When I put this file format in a dataflow I can also see the content of this file with the preview functionality BUT when I run the job I get an error message:

      (14.2) 08-18-15 12:01:27 (E) (6992:3831240448) RUN-050011: |Data flow DF_HDFS_READ_DEMO|Reader HDFS_0_CONF__AL_ReadFileMT_Read

                                                                 Error: <Failed to initialize HDFS. Check HDFS environment setup>.

      My Environment:

      -BODS 4.2 on Linux Redhat

      -Hadoop 2.6.0 client installed on the BODS server

      The hadoop cluster where I am trying to connect to is a Hortonworks Hadoop cluster.

      The only difference between my installation procedure and your proposal in the blog is that I have installed the hadoop client on the BODS server manually and not via Ambari server.

      I have spent a lot of time to trying to solve it but without success. Do you know what should I check? Do you have some more documentation about that?

      The Hive connector could be configured and it is working well.

      Thank you a lot!


      Author's profile photo Arne Weitzel
      Arne Weitzel
      Blog Post Author

      Hi Pablo

      it is very difficult for me to guess what might be wrong in your setup.

      I think I would try these steps top narrow down the problem:

      Login to the unix server with the system account used byx the DS jobserver

      Source in the the DS environment (if not already done the the login profile): $LINK_DIR/bin/, $LINK_DIR/hadoop/bin/ etc.

      1. Try some HDFS commands like  hdfs dfs -ls /
      2. If you don't get any further with the step above go the hard way and compile an execute a small C test program for HDFS. This will usually print meaninfgul messages which may help, for instance:

      Author's profile photo Former Member
      Former Member

      Hi Arne,

      Nice document. We are trying to load file into HDFS. However we are getting error like "Error: HDFS Failed to connect to <name node server>.

      Our environment is BODS 4.2 SP7. We are trying to connect to hortonworks HDP2.3 and we also installed HDP package on BODS server. we installed HDFS, PIG Clients on our BODS environment. We are able to ping the HADOOP cluster and open HIVE databases. But still we are getting error like  "Error: HDFS Failed to connect to <name node server>.

      Any idea. Appreciate your help.