Technical Articles
Some Notes on Docker File Creation on SAP Data Intelligence
Introduction
As a Data Scientist or Data Engineer you might not be too familiar hands-on with Docker. At least this was my start. I knew about the appealing concept of containerising applications but when developing pipelines or operators on SAP Data Intelligence I was always happy when having an existing docker image that I could use. With time requests had come to leave my comfort zone and to learn more about using Docker. Eventually I had to realise that working with docker directly is not that hard as expected and the learning curve is rather short and steep than painstakingly long.
In this blog I give a short introduction of Docker from an SAP Data Intelligence angle. This is followed firstly how to add python packages with pip and secondly what needs to be done if another package manager is required. Finally I delve into the challenge when more elaborate installation tasks had to be added to a Dockerfile. For the sake of your nerves and fingernails this should be done and tested interactively before building an image on a SAP Data Intelligence instance.
Docker on SAP Data Intelligence
In general you can use any docker image to run on DI. You only have to ensure that it is correctly tagged so that the pipeline scheduler can select the appropriate docker container that provides the libraries required by the operators.
You might run into the challenge of using operators having tags that none of the existing docker image complies with, e.g. ‘flowagent’ and ‘python36’. Then either you
- group parts of the pipeline for running them in different docker containers with the caveat of the data volume restriction or
- enhance one of the images with the necessary packages
From performance reasons you might consider running a pipeline in one container then spread it to multiple ones.
Enhancing Existing Docker Images with pip
SAP has an enterprise support aggreement with Suse and uses SLES as the basis for most of the operators. If you like for example add python packages like ‘pandas’ then you can select the base image with the reference character ‘$’
FROM $com.sap.sles.base
or directly pull the image from the repository with the reference character ‘§’
FROM §/com.sap.datahub.linuxx86_64/sles:15.0-sap-007
The latter might miss some enhancements that might be added to the Dockerfile in com.sap.sles.base. With that method you can also inherit from non-standard images that have been built and pushed to the local Docker registry from outside of SAP Data Hub / SAP Data Intelligence (on premise). This is often required when it is only allowed to use trusted images that have been hardened according to the company policy. The syntax is as follows: FROM §/<image-name-in-repo>:<version>.
With SAP Data Intelligence 3.0 you are required to run containers not a ‘root’ user. That means you have to add group and a user to each docker/container:
RUN groupadd -g 1972 cmddata && useradd -g 1972 -u 1972 -m cmddata
USER 1972:1972
WORKDIR "/home/cmddata"
ENV HOME=/home/cmddata
ENV PATH="${PATH}:${HOME}/.local/bin"
In addition I recommend to set some environment variables accordingly. In particular adding the user ‘bin/’ directory in case binaries are installed there as well.
Finally your new Dockerfile might look like:
FROM $com.sap.sles.base
RUN groupadd -g 1972 cmddata && useradd -g 1972 -u 1972 -m cmddata
USER 1972:1972
WORKDIR "/home/cmddata"
ENV HOME=/home/cmddata
ENV PATH="${PATH}:${HOME}/.local/bin"
RUN python3.6 -m pip --no-cache-dir install 'pandas' --user
RUN python3.6 -m pip --no-cache-dir install 'scikit-learn' --user
Do not forget adding the option ‘–user’ to the pip command to ensure that the package is only installed with user authorities.
It is very important that you tag the new Docker image not only with the newly added packages but also refer to the tags of the base image. There is currently (SAP DI 2.6) no inheritance process in place. In our particular case it would like as
- default
- sles
- python36
- tornado – 5.0.2
- pandas
- scikit-learn
Enhancing Existing Docker Images with other Package Manager
Enhancing the SAP provided and maintained imagages has its limitations because you can only use ‘pip’ for installing python packages. If the use of other package managers like ‘apt-get’ from ubuntu, ‘zypper’ from suse, etc. is necessary then you have to fall back to openly available images.
Fortunately there is already an image that contains the basic packages and can be enhanced as you like. It can be found in the Modeler ->repository/dockerfiles folder with the path:
$com.sap.opensuse.golang.zypper
and the definition:
FROM $com.sap.sles.base
RUN groupadd -g 1972 cmddata && useradd -g 1972 -u 1972 -m cmddata
USER 1972:1972
WORKDIR "/home/cmddata"
ENV HOME=/home/cmddata
ENV PATH="${PATH}:${HOME}/.local/bin"
ARG GOPATH=/gopath
ARG GOROOT=/goroot
ENV GOROOT=${GOROOT}
ENV GOPATH=${GOPATH}
ENV PATH=${GOROOT}/bin:${GOPATH}/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
RUN zypper --non-interactive update && \
# Install tar, gzip, python, python3, pip, pip3, gcc and libgthreadzypper --non-interactive install --no-recommends --force-resolution \
tar \
gzip \
python3 \
python3-pip \
gcc=7 \gcc-c++=7 \
libgthread-2_0-0=2.54.3 &&
# Install tornado
python3 -m pip --no-cache install tornado==5.0.2 --user
COPY sapgolang.tar.gz /tmp/sapgolang.tar.gz
RUN mkdir -p $GOROOT && \
tar -xzf /tmp/sapgolang.tar.gz --strip-components=1 -C ${GOROOT}
and the tags
- opensuse
- python36
- tornado – 5.0.2
- sapgolang – 1.12.1-bin
- zypper
This base image enables you to run the package manager “zypper” for installing further packages to the image e.g.:
RUN zypper in gcc-fortran
Interactively Creating Dockerfiles
If you need to build more complex Dockerfiles than just adding a couple of simple packages with pip and zypper then you are strongly advised to do so locally first before adding lines in the Dockerfile on a SAP Data Intelligence instance unless you are an exceptional OS-admin and Docker guru. If you belong to the more ordinary kind of a developing data scientist or data engineer, the fast try-and-error approach might be more appropriate. This means you need to install Docker first locally,
and maybe read about the limited number of commands you are going to use in Dockerfiles. On my opinion a Dockerfile is just an installation batch-script that processes the commands outlined. In the vastness of the internet you are going to find hosts of good introductory pages to Docker.
In the following I take up a request from a customer in the meteorology business to use special libraries needed to write operators in Python. My first trial was just to add the necessary lines to my most favourite Docker image ($com.sap.sles.base)
RUN zypper addrepo https://download.opensuse.org/repositories/home:SStepke/openSUSE_Leap_15.0/home:SStepke.repo
RUN zypper refresh
RUN zypper install eccodes
and fell flat on my face. The succinct error message just told me that the build has failed.
So I started my search for enlightenment locally with the base image *opensuse/leap:15.0* and the basic extension of the Dockerfile ‘$com.sap.opensuse.golang.zypper’.
Preparation
I created a directory that contains the Dockerfile ‘$com.sap.opensuse.golang.zypper.Dockerfile’ and ‘sapgolang.tar.gz’ because the latter is needed as well.
Then I opened a terminal, went to the above folder and started a build process with
docker build --tag eccodes .
and after a some time I got a list of my images with the command
docker images
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
eccodes latest 44b88839c5b3 44 seconds ago 661MB
opensuse/leap 15.0 7b6c420ec38e 9 days ago 104MB
with
docker images --all
I could see that it was a stacked building process where a lot of child images had been produced.
$ docker images --all
REPOSITORY TAG IMAGE ID CREATED SIZE
eccodes latest 44b88839c5b3 3 minutes ago 661MB
<none> <none> 4dbeed3246d5 3 minutes ago 533MB
<none> <none> ad60d5a15d70 3 minutes ago 530MB
<none> <none> 773c4c187f90 3 minutes ago 526MB
<none> <none> 4744b754f3a7 3 minutes ago 517MB
<none> <none> fbfaf8d1d6c0 11 minutes ago 104MB
<none> <none> 55a13db79639 11 minutes ago 104MB
<none> <none> 7cd45134c515 11 minutes ago 104MB
<none> <none> ab159e9ee696 11 minutes ago 104MB
<none> <none> c7eeb77d5357 11 minutes ago 104MB
opensuse/leap 15.0 7b6c420ec38e 9 days ago 104MB
If the before mentioned new lines are added for installing the additional repository and the eccodes package then the image build is much faster but finally fails as well.
But now having the image locally I could run the docker container interactively using the shell and could test all commands step-by-step.
Step by Step Installation of a new Docker Image
For the step-by-step installation I first needed to run the container interactively
eccodes-di d051079$ docker run -it eccodes bash (or eccodes-di d051079$ docker run -it eccodes sh)
With this I am in the container and can enter the commands needed for the new Docker image.
1. Command
9b07363dfa92:/ # zypper addrepo https://download.opensuse.org/repositories/home:SStepke/openSUSE_Leap_15.0/home:SStepke.repo ```
-> – No issue
2. Command
9b07363dfa92:/ # zypper refresh
Retrieving repository 'SStepke's Home Project (openSUSE_Leap_15.0)' metadata ---------------------------------------------------------------[\]
New repository or package signing key received:
Repository: SStepke's Home Project (openSUSE_Leap_15.0)
Key Name: home:SStepke OBS Project <home:SStepke@build.opensuse.org>
Key Fingerprint: 02C16E40 E54FD96B 57CBFA85 B1A9061F 7E4A4A2F
Key Created: Tue Nov 6 15:33:51 2018
Key Expires: Thu Jan 14 15:33:51 2021
Rpm Name: gpg-pubkey-7e4a4a2f-5be1b45fDo you want to reject the key, trust temporarily, or trust always? [r/t/a/?] (r):
This is an interactive command where the default was not helping at all. With some internet research I got the answer by adding the option –gpg-auto-import-keys.
3. Command
9b07363dfa92:/ # zypper --non-interactive install eccodes
ran when the option “`–non-interactive“` has been added.
Summary
Here we go. Now I had all the commands tested and the Dockerfile ran without complaints when the following 3 lines are added
RUN zypper addrepo https://download.opensuse.org/repositories/home:SStepke/openSUSE_Leap_15.0/home:SStepke.repo
RUN zypper refresh --gpg-auto-import-keys
RUN zypper --non-interactive install eccodes
Conclusion
With these learnings I am prepared to tackle a lot of challenges coming across when working with enhancing Dockerfiles with pip and zypper package managers. Now I do not shy away when there is an ask for some sophisticated tasks like adding binaries, setting system variables etc.
Hi Thorsten,
great job, very helpful!
After following your example, I found that two commands might need slight modification.
This command might have missed an argument “.” (docker build . --tag eccodes) when you are executing this within the eccodes-di folder.
“eccodes-di d051079$” might have got into this command line unintended.
Thank you so much and keep on with the outstanding blogs!
Best, Lijin
Very helpful thank you.
Hi Thorsten,
When I create the Docker base using:
FROM $com.sap.sles.base
RUN python3.6 -m pip –no-cache-dir install ‘pandas’
RUN python3.6 -m pip –no-cache-dir install ‘scikit-learn’
I get
command ‘/bin/sh -c python3.6 -m pip –no-cache-dir install ‘pandas” returned a non-zero code:1
Any idea on what could be the problem? Using DI 3.0.19
Thanks
-ravi
Found the answer...with the new security changes designed to prevent running Containers as root, adding a --user will work
FROM $com.sap.sles.base
RUN python3.6 -m pip –no-cache-dir install ‘pandas’ --user
RUN python3.6 -m pip –no-cache-dir install ‘scikit-learn’ --user
But thanks for the hint that the blog needs a revision.
I have updated the blog according to release 3.0.
Hi
I am some problems like return non-zero error.
I create the docker file correcly with the code:
FROM $com.sap.sles.base
RUN groupadd -g 1972 cmddata && useradd -g 1972 -u 1972 -m cmddata
USER 1972:1972
WORKDIR "/home/cmddata"
ENV HOME=/home/cmddata
ENV PATH="${PATH}:${HOME}/.local/bin"
RUN python3.6 -m pip --no-cache-dir install 'tensorflow' --user
RUN python3.6 -m pip --no-cache-dir install 'numpy' --user
But when i add the TAGS to the operator and run deploy i have the error non zero
Do you have any idea.
I changed commands with thoses:
RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow
But i get the same result.
I am working with the Cal Sap of Data Intelligence
Thanks,
Best,Sergio
Hi Thorsten,
great info.
Just a quick note: the com.sap.sles.base docker image already includes the vflow user handling logic at the end of its definition, so if you’re referring to it, you don’t need to add it again.
You can just add the pip commands after the reference, e.g.:
Hi Thorsten, Thorsten Hapke
We are trying to install fbprophet lib from Conda in a docker file.Here is the script
FROM §/com.sap.datahub.linuxx86_64/sles:15.0-sap-020
RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow
RUN conda install -c conda-forge/label/cf201901 'fbprophet' --user
but docker image failed to build.We added the default tags with file. can you please review the script and tell me what is wrong with the script. We are using DI 3.0
Regards,
Arindom Saha
Hi, Arindom,
have you tested the dockerfile when creating an image on your computer? E.g. your could use as a base image:
FROM opensuse/leap:15.1FROM opensuse/leap:15.1
I suppose some additional libs are needed when installing fbprophet. I can recall faintly when I have used this package I also had some installation issues.
Or conda is not available and you rather have to use pip.
Best,
Thorsten
Hi Thorsten Hapke ,
Please help me with the below issue.
Issue
I had to install python libraries in 2 separate docker files as they are using different docker containers and DI modeler did not allow to use 2 docker containers in single docker file. Now when I want to use the installed docker files while grouping in a python operator and assigning the tag, I can't use 2 tags in combination (1 tag for each docker file). Graph is throwing the error as below:
Modeler - Docker Issue
I have created 2 docker files as follows:
FROM $com.sap.sles.ml.python
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow
RUN python3.6 -m pip --no-cache-dir install --user --upgrade pip
RUN python3.6 -m pip --no-cache-dir install --user fbprophet
RUN python3.6 -m pip --no-cache-dir install --user xgboost
FROM §/com.sap.datahub.linuxx86_64/sles:15.0-sap-020
RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow
RUN python3.6 -m pip --no-cache-dir install 'future' --user
RUN python3.6 -m pip --no-cache-dir install 'sklearn' --user
RUN python3.6 -m pip --no-cache-dir install 'statsmodels' --user
Can you please suggest how to use more than 1 tag to group python operator? It fails for me in every scenario.
Regards
Achin Kimtee
I found that 2 docker files cannot be used as tags while grouping python operator. All the python libraries have to be installed in a single docker file and that tag has to be used.
For my issue, below code worked to install all python libraries in 1 docker file.
Regards
Achin Kimtee
Remark on automatic tag inheritance:
As I see in the SAP Help Portal, the automatic tag inheritance mechanism is already implemented within SAP Data Intelligence Cloud Edition: https://help.sap.com/viewer/1c1341f6911f4da5a35b191b40b426c8/Cloud/en-US/d49a07c5d66c413ab14731adcfc4f6dd.html
So the need to also use the tags from the base image becomes obsolete.
Hope this helps! 🙂
Thanks for the nice article. Best
Johannes
Hi,
followed the blog to create a docker that runs detectron.
Now the new release of DI has arrived with all python3.9.
That left all my dockers crashing including the above.
What would be needed to use python 3.9 in the above docker ?
Any help highly appreciated.
Regards
Marcus
Currently (2209.14.15) we are having 2 python versions (3.6 and 3.9) for the 2 generations of pipeline. For custom Dockerfiles you need to install packages with the command:
For 2nd generation operators (3.9):
or for 1st generation operators (3.6)
Hm, this is not working for me.
Embedding the python3 operator (Gen1) in the above docker gives:
engine com.sap.python36 failed with error: exit status 127: "/vrep/vflow/subengines/com/sap/python36/run.sh: line 3: python3.9:"
It seems that no python3.9 is installed via the docker and the python operator expects this in the latest version of DI.
What might be wrong ?
Any help appreciated.
Regards
Marcus