Skip to Content
Author's profile photo David Pugh

Getting started with Data Services & Hadoop

I wanted to learn about how the SAP EIM platform, specifically Data Services, integrated into Hadoop. Being a bit of a techie at heart and the fact that I’m not really into reading manuals I thought I’d gets hands on experience and decided to install my own virtual machine with both Hadoop and Data Services running on it. I know this sort of defeats the purpose of Hadoop with it’s distributed file system and processing capabilities but it was the easiest way for me to learn.

I’m no Linux expert but with my basic knowledge and with some help from Google (other search engines are available) I decided to install the Intel distribution on a Linux virtual machine. Intel utilise the Intel Manager framework for managing and distributing Hadoop clusters and is relatively straight forward to get up and running. Once installed this provides a nice, easy to use web interface for installing the Hadoop components such as HDFS, Oozie, MapReduce, Hive etc.. These can of course all be installed manually but that takes time and using Intel Manager allowed me to provision my first Hadoop cluster (single node) relatively quickly.

A detailed explanation of the different Hadoop components can be found on the Apache Hadoop site –

Once Hadoop was up and running the next step was to install Data Services. I decided to go with Data Services 4.2 and this of course requires the BI Platform so I went with 4.0 SP7 as Data Services 4.2 doesn’t yet support 4.1 of the BI platform. I went with the default installation and used the Sybase SQL Anywhere database, that is now bundled with the BI platform install, as the repository for both the CMS and the Data Services.

As per the technical manual, Data Services can connect to Apache Hadoop frameworks including HDFS and Hive sources and targets. Data Services must be installed on Linux in order to work with Hadoop. Relevant components of Hadoop include:

HDFS: Hadoop distributed file system. Stores data on nodes, providing very high aggregate bandwidth across the cluster.

Hive: A data warehouse infrastructure that allows SQL-like ad-hoc querying of data (in any format) stored in Hadoop.

Pig: A high-level data-flow language and execution framework for parallel computation that is built on top of Hadoop. Data Services uses Pig scripts to read from and write to HDFS including joins and push-down operations.

Data Services does not use ODBC to connect to Hive it has it’s own adapter so you must 1st configure this in the management console. The technical manual has all the details and it is fairly straightforward to configure. Make sure you have all the relevant jar files listed in the classpath.

The next step in my learning with Hadoop and Data Services is to create a demo scenario. Rather than invent one I’m going to use an existing demo that can be downloaded from the SAP Developer Network and adapt it to work with Hadoop rather than a standard file system & database. I’m going to use the Text Data Processing Blueprints 4.2 Data Quality Management which takes 200 unstructured text files, passes them through Data Services Entity Extraction Transform and loads the results into a target database.

I’m going to put these files into HDFS and read them out using the Data Services HDFS file format, pass the data through the standard demo data flow and then load the data into Hive tables.

The demo comes with a BusinessObjects Universe and some Web Intelligence reports so time permitting I may port these over to read from Hive as well.

I’ll hopefully create my 2nd blog once I’ve completed this with my finding and a recorded demo.

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Tammy Powlas
      Tammy Powlas

      Congrats on your first blog!

      I recommend you move this from your personal space to Enterprise Information Management space for higher visibility

      Author's profile photo David Pugh
      David Pugh
      Blog Post Author

      Thanks Tammy. I've moved it.

      Author's profile photo Former Member
      Former Member

      Hello David thanks for the post, Im trying to connect data services to Hive on a Hortonworks Sandbox running on Windows HyperV.  Would it be possible for us to connect to Hive from Data Services using Hive ODBC drivers.

      Author's profile photo David Pugh
      David Pugh
      Blog Post Author

      Hi Vinay,

      As far as I'm aware Data Services supports Hadoop / Hive when running on the Linux platform as we have a specific adapter for Hive that doesn't use ODBC.

      Data Services does have a Generic ODBC option though and this could potentially be used to connect to Hive. I haven't tried it though.



      Author's profile photo Avinash Verma
      Avinash Verma

      Nice blog David.. !

      It would be great if there will be separate technical paper/blog by you on integration of Hadoop and Data services with all detail level technical information+configurations.



      Author's profile photo Dirk Venken
      Dirk Venken

      Data Services 4.2 does support 4.1 of the BI platform, now. Check out SAP Note 1740516 for all the details.

      Author's profile photo Former Member
      Former Member

      Hi David, very good detailed explanation. can you please provide me the next blog links on this connection to hdfs files using Data Services.