SAP Data Services & Apache HADOOP
Hello viewers! This article covers the integration of SAP data services 4.2 SP 10 and above with Apache hadoop. From data services 4.2 Support Pack 10 and above, we have the flexibility to opt out the data services job server installation on hadoop name node, for more information: configuring WebHDFS would ease the job for us!!
You might be thinking that this is just another requirement and why do we need a new article for the same. We had received a demo request in the recent past from a prospect to integrate SAP data services 4.2 SP 10 version on Windows box having hadoop on Linux server. But until we had Data Services 4.2 Support Pack 10 it is a straight forward answer and it is: “Installing SAP Data Services job server components on one of the hadoop nodes”, no matter whether it is on linux or Windows.
The reason behind this article is that i have not found good content(blog/document/guide) which would help configuring the same. Here is how it works from data services standpoint. Connectivity can be established via WebHDFS File Locations as below.
WebHDFS should be configured with port enabled on a HADOOP name node. Ideally it would be the same hadoop system that communicates with data services for hdfs andhive data transfer.
Target file should be a FlatFile format rather than hdfs format. For more information please follow “Supplement For Hadoop” for data services 4.2 SP 10 on wards.
Creating New File Locations:
With WebHDFS as communication protocol:
Testing SAP Data Services Read:
A test file has been placed under hdfs:// locations
Data in text.txt test file is “helloworld”.
In the below test job source file “Location” in the file format is pointing to WebHDFS_File_Location. Manually enter the “File name(s)” while defining the format.
Target file in this test job is a local directory on windows server. C:\Temp. Up on successful job execution, file/files will be transferred to temporary location.
Testing SAP Data Services Write:
Consider there is a Pharma company that will need to cleanse and push down their claims data into their data lakes for further processing. Example, data exploration by data scientists.
We had created data services job to load end results in a PharmaClaims.csv file and dump in our target HDFS location. The target format should a data services FlatFile but not HDFS. However, the files will be placed under hdfs location itself on the hadoop node. For example a .csv file or .txt file will be the file extension in the file format as below.
We use the created File Location as “Location” in the file format. Upon successful execution, the same file has been pushed down to hdfs:// location as below.
Lastly, Without having data services job server on hadoop name node, now we can move files to hdfs and hive! Please go through the below Supplement for more information on the same., The supplement discusses the Data Services objects and processes related to accessing your Hadoop account for downloading and uploading data, and the processes for configuring these objects.
For more information refer the below Supplement for Hadoop
Hope this would help fellow experts to work with hadoop & data services 4.2 SP 10 +. We will meet with another useful topic, stay tuned for more updates!
Can we connect Apache Hadoop with SAP Cloud Platform Integration (CPI) as a middle ware layer as well ?
I will have to give mixed answer. CPI usually can push data into file systems and can read from file system. I personally have never worked on a scenario like this, CPI with HADOOP.
Is your question related to SAP Data Services or SAP CPI ? If it is related to CPI i would recommend to post this question in CPI forums, more than happy to take any data services related questions here.
Thanks for the reply !!
We are using SAP CPI for Data Services for integration from cloud to On-Prem SAP and Non SAP systems.
Wanted to understand if we could leverage the same for Hadoop systems as well. Have posted a similar query on the CPI forum, we are analyzing on the suggestions provided by some of the experts.
Thanks for this detailed information...
Is it mandatory to install any other tools like Pig client to connect to WebHDFS or HTTPFS on BODS Server as mentioned in HDFS file location objects - SAP Help Portal" ??
it is mentioned like "The machine where the Data Services Job Server is installed has the Pig client installed"...however the pig client installation procedure is not mentioned anywhere...
Thanks & Best Rgds,