Using SAP Data Hub without internet access
It is rare but sometimes required to work in environment without connectivity to the internet, not even with a secure proxy.
It is entirely possible, but it takes some additional efforts. Here’s a few tips to make your life easier if you find yourself in this situation. The installation happens in three steps:
- First specify the installer to download all required images into a folder
- Then transfer everything onto the secure network
- Finally, run the installer with a parameter pointing to the folder with the 41 GB of images
You’ll need additional docker images:
- registry (for insecure docker registry)
- A software binary repository: nexus, artifactory, or any alternative
And you’ll also need additional programs:
- skopeo to assist in completing the installation pre requisites.
- python3 with setuptools (pip and twine)
After the installation completes successfully, it’s just the beginning !
Running a sample pipeline will fail if the insecure docker registry uses HTTPS. The solution is simply to import it with the “Connection Management” application.
After testing the demo pipelines, it’s time to build your own, and to do so, you’ll probably require external libraries for python, java or node that aren’t available. That’s where the software binary repository comes into play. For instance, to write a custom python operator that connects to a SOAP web service, we need a package called zeep, with this dependency tree:
In order to use zeep in this offline environment, we need to transfer 15 python packages ! And the different custom operators will require even more packages so it is important to have a tool to solve this issue. We will make pip commands point to a local repository manager that will provide the required libraries in the correct versions.
Prepare the offline python repository
we installed the nexus docker image inside OpenShift.
Then we did some setup connected to the html administration UI to:
- create a python hosted repository.
- granted the browse and upload role on that repo to the anonymous user
To download all required packages, use the pip command:
pip3 download --only-binary=:all: --python-version 36 --platform manylinux1_x86_64 -d . zeep
Looking in indexes: https://pypi.python.org/simple/
Collecting lxml>=3.1.0 (from zeep)
Downloading https://files.pythonhosted.org/packages/ec/be/5ab8abdd8663c0386ec2dd595a5bc0e23330a0549b8a91e32f38c20845b6/lxml-4.4.1-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
|████████████████████████████████| 5.8MB 589kB/s Saved ./lxml-4.4.1-cp36-cp36m-manylinux1_x86_64.whl
Successfully downloaded zeep six defusedxml requests-toolbelt pytz isodate appdirs requests lxml attrs cached-property urllib3 certifi chardet idna
Then you should tar the wheels files and transfer them onto the secure environment.
Load packages in the offline python repository
To upload the wheels into the repository, you need a tool called twine, it’s the opposite of pip. It should be included with the python rpm for your distribution.
twine upload --repository-url <your repo url> *.whl
Enter your username: admin
Enter your password:
Uploading distributions to <your repo url>
100%|█████████████████████████████████████████████████████████| 24.1k/24.1k [00:00<00:00, 382kB/s]
Use those libraries in a custom operator
We make a new docker image that includes one or more additional libraries and will be used in custom operators.
It might be cumbersome to make one docker image for every external package, so we could make one for all small packages and one for each big package like tensorflow (100 MB)
FROM $com.sap.opensuse.python36 RUN python3 -m pip config --global set global.index nexus \ && python3 -m pip config --global set global.index-url your_repo_url \ && python3 -m pip config --global set global.trusted-host your_repo_host RUN python3 -m pip install zeep
And off you go !
Then we make a graph, add a custom python3 operator, place it in a group and add a tag unique to the previous docker image. (For all practical questions including tagging, I reach out to our champion Henrique Pinto)
And in that operator, the python library zeep is available for use !