Skip to Content
Author's profile photo Ian Henry

Connecting SAP Data Hub Dev Edition to a Trial Instance of SAP Big Data Services (fka Altiscale)

Since SAP Data Hub 1.3 connectivity to SAP Cloud Platform Services, SAP Big Data Services is officially supported.

But what if you don’t yet have a production instance of BDS and you want to connect it to the SAP Data Hub Dev Edition? – Then follow this blog for the steps needed.

Thanks to an esteemed but anonymous colleague, I can share how we can easily connect SAP Data Hub Dev Edition to SAP Big Data Services (BDS), formerly known as Altiscale.

In this example we are using the Developer Edition of SAP Data Hub and a Trial Account of SAP Big Data Services (BDS).

For a production deployment the networking links would need to be established by the corresponding network administrators.  We will use the HttpFS (WebHDFS), within BDS.

This is configuration could also be applied to other cloud environments that provide SSL access.

Steps required to use the BDS Trial Edition

1. Verify HttpFS Connection within BDS

2. Customise & Build the docker in the Developer Edition

3. Run the Docker

4. Create the Data Hub Pipeline

Big Data Services – Managed Services

Several managed services for data storage and data processing

  • HIVE Server 2 Service
  • HttpFS Service (WebHDFS)
  • Spark Controller
  • Smart Data Integration, SDI Service

These can be seen through the BDS portal, we will use the HttpFS Gatway.

If you are using a Trial Account of Big Data Services then you will need to establish an SSH tunnel as the HttpFS gateway is not exposed to the public internet.

We can verify the connection by first connecting to BDS/Altiscale via SSH with the SSH config as described here Using the HttpFS Service in your Big Data Services Cluster.  One essential part is the port forwarding, that can be established such as below.

LocalForward 24000 httpfs-sap.s3s.altiscale.com:14000
LocalForward 20000 hiveserver-sap.s3s.altiscale.com:10000

Customise & Build the docker in the Developer Edition

The Data Hub Dev Edition docker does not include SSH, we therefore need to add it.

This can be done by editing the Dockerfile and adding “openssh-client” as below

DatahubDevEdition/Dockerfile
# ---------------- Required OS software packages ------------------------------

#   These are required OS software package that are needed during runtime of the Dev Edition.
RUN apt-get install -y --no-install-recommends --fix-missing \
            curl unzip \
            python libnuma-dev libltdl7 libaio-dev python-colorama python-psutil python-protobuf ca-certificates \
	    openssh-client

In the same file, we also need to copy the ssh config to the container.
Add the following just after the comment  “—- SAP Data Hub Developer Edition —-”

RUN mkdir -p /root/.ssh
COPY files_bds/* /root/.ssh/

Inside the directory where the Dockerfile file is, create a new directory called “files_bds”

Copy into this directory your private key (id_rsa) and the BDS SSH config that was previously used to connect to BDS via SSH.

Host bds
       HostName sap.z42.altiscale.com
       User i012345
       Port 1621
       IdentityFile ~/.ssh/id_rsa
       Compression yes
       ServerAliveInterval 15
       TCPKeepAlive yes
       AddressFamily inet
       DynamicForward localhost:1080
       LocalForward 24000 httpfs-sap.s3s.altiscale.com:14000
       LocalForward 10000 hiveserver-sap.s3s.altiscale.com:10000

Buiid the Docker image as described at Set up SAP Data Hub, developer edition Step 3 onwards.

docker build --build-arg VORA_USERNAME=vora --build-arg VORA_PASSWORD=SomeNicePassword19920706 --tag datahub .

I received the following error, in a couple of places, fixed by adding the fixes op in couple of places.

E: Failed to fetch http://security.debian.org/pool/updates/main/p/perl/perl-modules_5.20.2-3+deb8u10_all.deb 404 Not Found [IP: 217.196.149.233 80]

E: Failed to fetch http://security.debian.org/pool/updates/main/p/perl/perl_5.20.2-3+deb8u10_amd64.deb 404 Not Found [IP: 217.196.149.233 80]

E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
# --- Tensorflow ---
RUN apt-get update
RUN apt-get install -y --fix-missing python-pip python-numpy python-scipy python-enum python-pillow
RUN pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-none-linux_x86_64.whl
RUN pip install pandas

Once you receive the successful output from the image, such as below

Successfully built b998036d2db4
Successfully tagged datahub:latest

Run the Docker Container

We can now run the docker container, and test the connectivity. First time create the container.

docker run -ti --publish 127.0.0.1:8090:8090 --publish 127.0.0.1:8998:8998 --publish 127.0.0.1:9225:9225 --publish 127.0.0.1:50070:50070 --name datahub --hostname datahub --network dev-net datahub run --agree-to-sap-license --hdfs --livy

Following that we can restart the above container with

docker start -i datahub

We need to establish the SSH within the running docker.

Once we have started the docker instance we can identify our running container.

MacBook-Pro:~ i012345$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                                                                                NAMES
c1ade9024522        datahub             "/dev-edition.sh run…"   About an hour ago   Up About an hour    127.0.0.1:8090->8090/tcp, 127.0.0.1:8998->8998/tcp, 127.0.0.1:9225->9225/tcp, 127.0.0.1:50070->50070/tcp, 9099/tcp   datahub

Connecting to the running container from the host machine

Run a bash interpreter in the docker container identified by its ID

docker exec -ti  c1ade9024522 bash

We are now in the docker and can create the SSH connection

root@datahub:~/.ssh# ssh bds
Last login: Wed Jun 20 09:04:45 2018 from 10.252.16.219
        _  _    _                    _
       | || |  (_)                  | |
  __ _ | || |_  _  ___   ___   __ _ | |  ___
 / _` || ||  _|| |/ __| / __| / _` || | / _ \
| (_| || || |_ | |\__ \| (__ | (_| || ||  __/
 \__,_||_||___||_||___/ \___| \__,_||_| \___|

[i012345@desktop-sap ~]$ 

In a separate terminal session we can verifying the connection

MacBook-Pro:~ i012345$ docker exec -ti c1ade9024522  bash
root@datahub:/tmp# curl -i -L "http://localhost:24000/webhdfs/v1?op=gethomedirectory&user.name=i012345"
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: hadoop.auth="u=i012345&p=i012345&t=simple-dt&e=1529523522202&s=cdo0Yh1eiQJd8OM0nR7Jeicwt7M="; Path=/; Expires=Wed, 20-Jun-2018 19:38:42 GMT; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked
Date: Wed, 20 Jun 2018 09:38:42 GMT

{"Path":"\/user\/i012345"}

All looks to be good, we can now proceed to build the pipeline.

Create the Data Hub Pipeline

Below I have added the Message Generator, Multiplexer and Wiretap.  The only component requiring configuration is the WebHDFS producer.  Localhost:24000 is now forwarded to BDS.

Before we run our pipeline lets check the BDS HDFS directory.

[i012345@desktop-sap ~]$ !23
hadoop fs -ls /user/i012345
Found 2 items
drwx------   - i012345 users          0 2018-06-27 18:00 /user/i012345/.Trash
-rw-r--r--   3 i012345 users         18 2018-06-18 12:20 /user/i012345/temp.txt

After running the pipeline, we can check again

Finally, we can see the text files are generated with HDFS inside BDS.

[i012345@desktop-sap ~]$ hadoop fs -ls /user/i012345
Found 16 items
drwx------   - i012345 users          0 2018-06-27 18:00 /user/i012345/.Trash
-rwxr-xr-x   3 i012345 users        153 2018-06-28 09:53 /user/i012345/file_1.txt
-rwxr-xr-x   3 i012345 users        154 2018-06-28 09:54 /user/i012345/file_10.txt
-rwxr-xr-x   3 i012345 users        151 2018-06-28 09:54 /user/i012345/file_11.txt
-rwxr-xr-x   3 i012345 users        154 2018-06-28 09:54 /user/i012345/file_12.txt
-rwxr-xr-x   3 i012345 users        149 2018-06-28 09:54 /user/i012345/file_13.txt
-rwxr-xr-x   3 i012345 users        151 2018-06-28 09:54 /user/i012345/file_14.txt
-rwxr-xr-x   3 i012345 users        151 2018-06-28 09:53 /user/i012345/file_2.txt
-rwxr-xr-x   3 i012345 users        153 2018-06-28 09:53 /user/i012345/file_3.txt
-rwxr-xr-x   3 i012345 users        151 2018-06-28 09:53 /user/i012345/file_4.txt
-rwxr-xr-x   3 i012345 users        149 2018-06-28 09:53 /user/i012345/file_5.txt
-rwxr-xr-x   3 i012345 users        151 2018-06-28 09:53 /user/i012345/file_6.txt
-rwxr-xr-x   3 i012345 users        149 2018-06-28 09:53 /user/i012345/file_7.txt
-rwxr-xr-x   3 i012345 users        153 2018-06-28 09:53 /user/i012345/file_8.txt
-rwxr-xr-x   3 i012345 users        154 2018-06-28 09:54 /user/i012345/file_9.txt
-rw-r--r--   3 i012345 users         18 2018-06-18 12:20 /user/i012345/temp.txt
[i012345@desktop-sap ~]$ 

Assigned Tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Rajendra Chandrasekhar
      Rajendra Chandrasekhar

      Hi Ian,

      Very informative blog,

      I am having Data Hub 2.3 on Monsoon and have a BDS sandbox. How do I connect DH to BDS hdfs ? Is this possible ?

       

      Thanks,

      Rajendra

      Author's profile photo Ian Henry
      Ian Henry
      Blog Post Author

      Hi Rajendra,
      Yes, I am sure this is possible, it just a matter of connectivity.
      BDS is officially supported from Data Hub 2.3

      How would you normally connect to your trial instance?
      Can you access the HDFS filesystem directly from your Monsoon machine?
      If you need to establish an SSH tunnel first, then the process could be similar as you would need to create a custom docker that contains the ability to connect to BDS.