Learnings from Preparing the Environment for SAP Data Hub in the Google Cloud
Some time ago, my colleague Thorsten Schneider wrote two blogs about the SAP Data Hub as containerized application and about its installation.
In the later blog Thorsten put a remark that the recommended installation procedure for SAP Data Hub is using SAP Maintenance Planner instead of the command-line tool (install.sh). This procedure was described recently in another community blog by Stefan Jakobi.
In my two blogs, I want to take one step back and to focus rather on the creation of the environment which you need to successfully install SAP Data Hub 2.
For setting up my environment, I am going to use Google Cloud Platform. The described setup originated from an SAP-internal test project with the focus on the initial installation of SAP Data Hub and therefore the requirements for the setup were relatively simple compared to a real-life customer project. In our considerations and architectural decisions however, we tried to think as a potential customer would do and took especially the topic of security very seriously. You will notice though that on the other hand more advanced topics like the connectivity to external data sources were not in scope of our test.
The experience from our tests was that setting up the installation host and preparing the required infrastructure consumes the most of the time, particularly if you do it for the first time. It applies even more if many (or all) of the components used in this procedure are new to you: new cloud provider, Kubernetes as a new infrastructure layer, containers as a new way of packaging and delivering the software, and so on.
To be honest with you, it was also a lot of learning for me personally and for the team I am part of. At the end we decided to share our experiences and considerations with you in the hope of providing you a starting point for your own SAP Data Hub project.
Where did we start?
In GCP, users and resources are organized in projects and not in accounts as you maybe know it from SAP Cloud Platform. For our test activities, we were provided with a new self-contained GCP project. No resources were predefined, and our team members got users with extensive authorizations in this GCP project. So that we could create all required resources by ourselves. “Self-contained” means also that we were not provided with a Shared VPC neither we had any sort of connectivity to our internal network with on-premise systems. We had the freedom to define and to manage the VPC network for our test setup by ourselves.
Kubernetes yes, but how?
With the Google Kubernetes Engine (GKE), Google Cloud provides a managed Kubernetes environment in which your own Kubernetes cluster is just few clicks away. But before start clicking we had to answer some questions to have a setup in place which fulfills our requirements.
Which Kubernetes version?
The “latest and greatest” is not the right approach in this case! But this question is still easy to answer. You can find the information about supported versions and constellations in the Product Availability Matrix. Just search there for “SAP Data Hub 2” and study the document with the essential information. In this document, you can find the list of infrastructure platforms and Kubernetes versions supported for the targeted version of SAP Data Hub.
You would not find in the PAM any hints regarding the version of the Kubernetes command-line tool (kubectl) which you need for the installation process on the installation host. The Kubernetes documentation contains however the following statement:
You must use a kubectl version that is within one minor version difference of your cluster. For example, a v1.2 client should work with v1.1, v1.2, and v1.3 master. Using the latest version of kubectl helps avoid unforeseen issues.
Sizing of Kubernetes Cluster
If you look into the SAP Data Hub documentation, you can find the remark that the required sizing of your Kubernetes cluster depends on the data volume and workload characteristics. In addition, the documentation gives you two recommendations for a minimum sizing of a productive and of a test/development environment.
Some more details were provided in the openSAP course “Freedom of Data with SAP Data Hub”. Just have a look at Week 2 Unit 4 of this course.
For our test project the minimum sizing with three worker nodes was good enough. But if we would have unexpectedly extended the scope of our tests and would have put some decent load on our SAP Data Hub at a later point of time – with Kubernetes it is not the end of the world. With the possibility to scale out and scale up the resources of a Kubernetes cluster, the resizing can be achieved with ease – especially in the cloud!
Which Kubernetes “flavor”?
After answering the versioning and sizing questions, we were confronted with the next challenging question: how should we setup a Kubernetes cluster for our test? Or to put it in different words: Which “flavor” of a Kubernetes cluster would fit the best?
Because we did not setup a productive instance of SAP Data Hub but had purely the installation procedure on our minds, we wanted to be pragmatic on the one side. But on the other side we wanted to take the topic of security very seriously as mentioned at the beginning.
After some internal discussions and consultations with our security experts, we narrowed down our options to a so called private cluster in Google Kubernetes Engine. GKE private clusters were first released in beta in March 2018 and when we started our test activities this feature was still beta. In such a private cluster the worker nodes are created without external IP addresses which makes them inaccessible from the public Internet. Additionally, you can control access to the master node of a private cluster by allowing explicitly certain IP ranges to communicate with the master.
You will see in the second blog that enabling the private cluster capability is basically not more than activation of a check-box in the GCP console. But as you would see later an additional GCP component is required to make your SAP Data Hub work after the installation.
Preliminary Version of the Test Environment
In the picture below, I tried to outline the main building blocks of the GCP environment which we used in our first test run for the installation procedure of SAP Data Hub:
- First, we created a VPC network with a subnet, required primary and secondary IP ranges, and a couple of firewall rules.
- In this new VPC network, we created a virtual machine which received an external IP address. This virtual machine became the single point of entry into the test environment and at the same time the installation host with all required tools like Docker, kubectl, etc. installed and configured.
- Our private Kubernetes cluster was created in such a way that the master could only be accessed from internal IP addresses in the same VPC network. It is the most restrictive option for private clusters. Even the access from the SAP network or from the GCP Cloud Shell was not possible.
This setup worked great for the installation of SAP Data Hub! Our test was therefore successful! In the last step we had to prove with a smoke test described in the documentation that the installed SAP Data Hub works as expected.
With the described test environment, the smoke test failed!
SAP Data Hub needed an access to an external container image registry to pull a base image for the test pipeline. But it was not possible because the nodes of our private Kubernetes cluster did not have external IP addresses and therefore could not communicate with services outside of GCP! How did we solve this issue?
Enabling Internet Access
The solution for the described issue was NAT – Network Address Translation!
In general, you can create and manage your own NAT gateway. But just in time for our tests, Google Cloud announced the availability of the managed Cloud NAT service. We did not want to make our test setup overly complicated and used the managed service.
Final Test Environment
With the addition of the Cloud NAT service, our test environment looked like in the picture below:
With this environment, we were not only able to successfully install the SAP Data Hub in the Google Cloud with the new SL Plugin tool. In addition, we could confirm the successful installation with a smoke test.
Final Words and Next Steps
In this blog, I did not want to give you the exact blueprint for your own SAP Data Hub environment. However, I wanted to give you an idea of some of the questions you have to answer before you can install SAP Data Hub. Some of those decisions can be corrected at a later point in time without any disruptions (e.g. the number and size of the worker nodes). But for some corrections you basically might need to start from scratch (e.g. network configuration).
Furthermore, I want to mention that Google and other cloud providers keep on rolling out new features and capabilities in a breathtaking pace. It is therefore very important to be up-to-date with their developments and their roadmaps. It definitely helps to make better decisions.
In my next blog, you can find the detailed instructions on how to implement the described environment in Google Cloud.
I used the Azure Kubernetes Service (AKS) for the DataHub 2.6 Implementation
Best Regards Roland