Connecting SAP Data Hub Dev Edition to a Trial Instance of SAP Big Data Services (fka Altiscale)
Since SAP Data Hub 1.3 connectivity to SAP Cloud Platform Services, SAP Big Data Services is officially supported.
But what if you don’t yet have a production instance of BDS and you want to connect it to the SAP Data Hub Dev Edition? – Then follow this blog for the steps needed.
Thanks to an esteemed but anonymous colleague, I can share how we can easily connect SAP Data Hub Dev Edition to SAP Big Data Services (BDS), formerly known as Altiscale.
In this example we are using the Developer Edition of SAP Data Hub and a Trial Account of SAP Big Data Services (BDS).
For a production deployment the networking links would need to be established by the corresponding network administrators. We will use the HttpFS (WebHDFS), within BDS.
This is configuration could also be applied to other cloud environments that provide SSL access.
Steps required to use the BDS Trial Edition
1. Verify HttpFS Connection within BDS
2. Customise & Build the docker in the Developer Edition
3. Run the Docker
4. Create the Data Hub Pipeline
Big Data Services – Managed Services
Several managed services for data storage and data processing
- HIVE Server 2 Service
- HttpFS Service (WebHDFS)
- Spark Controller
- Smart Data Integration, SDI Service
These can be seen through the BDS portal, we will use the HttpFS Gatway.
If you are using a Trial Account of Big Data Services then you will need to establish an SSH tunnel as the HttpFS gateway is not exposed to the public internet.
We can verify the connection by first connecting to BDS/Altiscale via SSH with the SSH config as described here Using the HttpFS Service in your Big Data Services Cluster. One essential part is the port forwarding, that can be established such as below.
LocalForward 24000 httpfs-sap.s3s.altiscale.com:14000
LocalForward 20000 hiveserver-sap.s3s.altiscale.com:10000
Customise & Build the docker in the Developer Edition
The Data Hub Dev Edition docker does not include SSH, we therefore need to add it.
This can be done by editing the Dockerfile and adding “openssh-client” as below
DatahubDevEdition/Dockerfile
# ---------------- Required OS software packages ------------------------------
# These are required OS software package that are needed during runtime of the Dev Edition.
RUN apt-get install -y --no-install-recommends --fix-missing \
curl unzip \
python libnuma-dev libltdl7 libaio-dev python-colorama python-psutil python-protobuf ca-certificates \
openssh-client
In the same file, we also need to copy the ssh config to the container.
Add the following just after the comment “—- SAP Data Hub Developer Edition —-”
RUN mkdir -p /root/.ssh
COPY files_bds/* /root/.ssh/
Inside the directory where the Dockerfile file is, create a new directory called “files_bds”
Copy into this directory your private key (id_rsa) and the BDS SSH config that was previously used to connect to BDS via SSH.
Host bds
HostName sap.z42.altiscale.com
User i012345
Port 1621
IdentityFile ~/.ssh/id_rsa
Compression yes
ServerAliveInterval 15
TCPKeepAlive yes
AddressFamily inet
DynamicForward localhost:1080
LocalForward 24000 httpfs-sap.s3s.altiscale.com:14000
LocalForward 10000 hiveserver-sap.s3s.altiscale.com:10000
Buiid the Docker image as described at Set up SAP Data Hub, developer edition Step 3 onwards.
docker build --build-arg VORA_USERNAME=vora --build-arg VORA_PASSWORD=SomeNicePassword19920706 --tag datahub .
I received the following error, in a couple of places, fixed by adding the fixes op in couple of places.
E: Failed to fetch http://security.debian.org/pool/updates/main/p/perl/perl-modules_5.20.2-3+deb8u10_all.deb 404 Not Found [IP: 217.196.149.233 80] E: Failed to fetch http://security.debian.org/pool/updates/main/p/perl/perl_5.20.2-3+deb8u10_amd64.deb 404 Not Found [IP: 217.196.149.233 80] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
# --- Tensorflow ---
RUN apt-get update
RUN apt-get install -y --fix-missing python-pip python-numpy python-scipy python-enum python-pillow
RUN pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-none-linux_x86_64.whl
RUN pip install pandas
Once you receive the successful output from the image, such as below
Successfully built b998036d2db4
Successfully tagged datahub:latest
Run the Docker Container
We can now run the docker container, and test the connectivity. First time create the container.
docker run -ti --publish 127.0.0.1:8090:8090 --publish 127.0.0.1:8998:8998 --publish 127.0.0.1:9225:9225 --publish 127.0.0.1:50070:50070 --name datahub --hostname datahub --network dev-net datahub run --agree-to-sap-license --hdfs --livy
Following that we can restart the above container with
docker start -i datahub
We need to establish the SSH within the running docker.
Once we have started the docker instance we can identify our running container.
MacBook-Pro:~ i012345$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c1ade9024522 datahub "/dev-edition.sh run…" About an hour ago Up About an hour 127.0.0.1:8090->8090/tcp, 127.0.0.1:8998->8998/tcp, 127.0.0.1:9225->9225/tcp, 127.0.0.1:50070->50070/tcp, 9099/tcp datahub
Connecting to the running container from the host machine
Run a bash interpreter in the docker container identified by its ID
docker exec -ti c1ade9024522 bash
We are now in the docker and can create the SSH connection
root@datahub:~/.ssh# ssh bds
Last login: Wed Jun 20 09:04:45 2018 from 10.252.16.219
_ _ _ _
| || | (_) | |
__ _ | || |_ _ ___ ___ __ _ | | ___
/ _` || || _|| |/ __| / __| / _` || | / _ \
| (_| || || |_ | |\__ \| (__ | (_| || || __/
\__,_||_||___||_||___/ \___| \__,_||_| \___|
[i012345@desktop-sap ~]$
In a separate terminal session we can verifying the connection
MacBook-Pro:~ i012345$ docker exec -ti c1ade9024522 bash
root@datahub:/tmp# curl -i -L "http://localhost:24000/webhdfs/v1?op=gethomedirectory&user.name=i012345"
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: hadoop.auth="u=i012345&p=i012345&t=simple-dt&e=1529523522202&s=cdo0Yh1eiQJd8OM0nR7Jeicwt7M="; Path=/; Expires=Wed, 20-Jun-2018 19:38:42 GMT; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked
Date: Wed, 20 Jun 2018 09:38:42 GMT
{"Path":"\/user\/i012345"}
All looks to be good, we can now proceed to build the pipeline.
Create the Data Hub Pipeline
Below I have added the Message Generator, Multiplexer and Wiretap. The only component requiring configuration is the WebHDFS producer. Localhost:24000 is now forwarded to BDS.
Before we run our pipeline lets check the BDS HDFS directory.
[i012345@desktop-sap ~]$ !23
hadoop fs -ls /user/i012345
Found 2 items
drwx------ - i012345 users 0 2018-06-27 18:00 /user/i012345/.Trash
-rw-r--r-- 3 i012345 users 18 2018-06-18 12:20 /user/i012345/temp.txt
After running the pipeline, we can check again
Finally, we can see the text files are generated with HDFS inside BDS.
[i012345@desktop-sap ~]$ hadoop fs -ls /user/i012345
Found 16 items
drwx------ - i012345 users 0 2018-06-27 18:00 /user/i012345/.Trash
-rwxr-xr-x 3 i012345 users 153 2018-06-28 09:53 /user/i012345/file_1.txt
-rwxr-xr-x 3 i012345 users 154 2018-06-28 09:54 /user/i012345/file_10.txt
-rwxr-xr-x 3 i012345 users 151 2018-06-28 09:54 /user/i012345/file_11.txt
-rwxr-xr-x 3 i012345 users 154 2018-06-28 09:54 /user/i012345/file_12.txt
-rwxr-xr-x 3 i012345 users 149 2018-06-28 09:54 /user/i012345/file_13.txt
-rwxr-xr-x 3 i012345 users 151 2018-06-28 09:54 /user/i012345/file_14.txt
-rwxr-xr-x 3 i012345 users 151 2018-06-28 09:53 /user/i012345/file_2.txt
-rwxr-xr-x 3 i012345 users 153 2018-06-28 09:53 /user/i012345/file_3.txt
-rwxr-xr-x 3 i012345 users 151 2018-06-28 09:53 /user/i012345/file_4.txt
-rwxr-xr-x 3 i012345 users 149 2018-06-28 09:53 /user/i012345/file_5.txt
-rwxr-xr-x 3 i012345 users 151 2018-06-28 09:53 /user/i012345/file_6.txt
-rwxr-xr-x 3 i012345 users 149 2018-06-28 09:53 /user/i012345/file_7.txt
-rwxr-xr-x 3 i012345 users 153 2018-06-28 09:53 /user/i012345/file_8.txt
-rwxr-xr-x 3 i012345 users 154 2018-06-28 09:54 /user/i012345/file_9.txt
-rw-r--r-- 3 i012345 users 18 2018-06-18 12:20 /user/i012345/temp.txt
[i012345@desktop-sap ~]$
Hi Ian,
Very informative blog,
I am having Data Hub 2.3 on Monsoon and have a BDS sandbox. How do I connect DH to BDS hdfs ? Is this possible ?
Thanks,
Rajendra
Hi Rajendra,
Yes, I am sure this is possible, it just a matter of connectivity.
BDS is officially supported from Data Hub 2.3
How would you normally connect to your trial instance?
Can you access the HDFS filesystem directly from your Monsoon machine?
If you need to establish an SSH tunnel first, then the process could be similar as you would need to create a custom docker that contains the ability to connect to BDS.