Getting started with Data Services & Hadoop
I wanted to learn about how the SAP EIM platform, specifically Data Services, integrated into Hadoop. Being a bit of a techie at heart and the fact that I’m not really into reading manuals I thought I’d gets hands on experience and decided to install my own virtual machine with both Hadoop and Data Services running on it. I know this sort of defeats the purpose of Hadoop with it’s distributed file system and processing capabilities but it was the easiest way for me to learn.
I’m no Linux expert but with my basic knowledge and with some help from Google (other search engines are available) I decided to install the Intel distribution on a Linux virtual machine. Intel utilise the Intel Manager framework for managing and distributing Hadoop clusters and is relatively straight forward to get up and running. Once installed this provides a nice, easy to use web interface for installing the Hadoop components such as HDFS, Oozie, MapReduce, Hive etc.. These can of course all be installed manually but that takes time and using Intel Manager allowed me to provision my first Hadoop cluster (single node) relatively quickly.
A detailed explanation of the different Hadoop components can be found on the Apache Hadoop site – http://hadoop.apache.org/
Once Hadoop was up and running the next step was to install Data Services. I decided to go with Data Services 4.2 and this of course requires the BI Platform so I went with 4.0 SP7 as Data Services 4.2 doesn’t yet support 4.1 of the BI platform. I went with the default installation and used the Sybase SQL Anywhere database, that is now bundled with the BI platform install, as the repository for both the CMS and the Data Services.
As per the technical manual, Data Services can connect to Apache Hadoop frameworks including HDFS and Hive sources and targets. Data Services must be installed on Linux in order to work with Hadoop. Relevant components of Hadoop include:
HDFS: Hadoop distributed file system. Stores data on nodes, providing very high aggregate bandwidth across the cluster.
Hive: A data warehouse infrastructure that allows SQL-like ad-hoc querying of data (in any format) stored in Hadoop.
Pig: A high-level data-flow language and execution framework for parallel computation that is built on top of Hadoop. Data Services uses Pig scripts to read from and write to HDFS including joins and push-down operations.
Data Services does not use ODBC to connect to Hive it has it’s own adapter so you must 1st configure this in the management console. The technical manual has all the details and it is fairly straightforward to configure. Make sure you have all the relevant jar files listed in the classpath.
The next step in my learning with Hadoop and Data Services is to create a demo scenario. Rather than invent one I’m going to use an existing demo that can be downloaded from the SAP Developer Network and adapt it to work with Hadoop rather than a standard file system & database. I’m going to use the Text Data Processing Blueprints 4.2 Data Quality Management which takes 200 unstructured text files, passes them through Data Services Entity Extraction Transform and loads the results into a target database.
I’m going to put these files into HDFS and read them out using the Data Services HDFS file format, pass the data through the standard demo data flow and then load the data into Hive tables.
The demo comes with a BusinessObjects Universe and some Web Intelligence reports so time permitting I may port these over to read from Hive as well.
I’ll hopefully create my 2nd blog once I’ve completed this with my finding and a recorded demo.