Setting Up the Environment for SAP Data Hub on Google Cloud Platform (GCP)
If you read my first blog then you are aware of the considerations which we had to make along our way to a successful installation of SAP Data Hub in a Kubernetes cluster on Google Cloud Platform. In the current blog, I want to provide you a concrete example of how we realized our considerations to deploy and configure GCP resources which are required for a successful installation of SAP Data Hub.
If you have never worked with GCP then this blog might appear to you as a GCP crash course. I promise to keep it as simple as possible and where necessary to provide you links for further reading.
- You need a GCP project
- Your user in this GCP project needs quite extensive authorizations to be able to create all required resources. I work with the role “Project Editor”.
Before you start
In the steps below, I am going to use the Google Cloud SDK to create different resources in my GCP project. You can either install it locally or use the Cloud Shell in the GCP Console. With the command gcloud config list you can check the settings of the active configuration of the Cloud SDK.
If not yet set then set the project ID, the default compute zone, and the default compute region with the following commands:
gcloud config set project [GCP_PROJECT_ID] gcloud config set compute/zone [COMPUTE_ZONE] gcloud config set compute/region [COMPUTE_REGION]
In this blog, I use the compute region europe-west3 and the compute zone europe-west3-a. You should change these settings according to your needs.
Set Up the Environment
Create a Custom Network
I create first a new dedicated network named datahub-network:
gcloud compute networks create datahub-network --subnet-mode=custom
The flag –subnet-mode=custom specifies that I want to create subnets manually (gcloud reference for more options).
In the new network, I create the subnet datahub-subnet-1 with the primary range and two secondary address ranges datahub-svc-range and datahub-pod-range:
gcloud compute networks subnets create datahub-subnet-1 \ --network datahub-network --range 10.0.4.0/22 \ --enable-private-ip-google-access --region europe-west3 \ --secondary-range datahub-svc-range=10.0.32.0/20,datahub-pod-range=10.4.0.0/14
- Primary range (10.0.4.0/22) is meant for Kubernetes cluster nodes,
- Secondary range (datahub-svc-range=10.0.32.0/20) is for the Service IP addresses, and
- Secondary range (datahub-pod-range=10.4.0.0/14) is for Pod IP addresses
- With the flag –enable-private-ip-google-access, I enable access to Google Cloud APIs for instances without a public IP address located in this subnet
Set Up Cloud NAT
First, I reserve a static external IP address. It is possible to automatically allocate IP addresses for the NAT gateway. They could however change so that an adaption in firewall rules of external services might become necessary. To avoid it, I reserve a static external IP address for the NAT gateway with the following command:
gcloud compute addresses create datahub-nat-gateway-ip --region europe-west3
I need to create the Cloud Router in the same region where the Kubernetes cluster is going to be created later (i.e. europe-west3):
gcloud compute routers create datahub-nat-router \ --network datahub-network --region europe-west3
To add a NAT configuration to the Cloud Router, I use the following command:
gcloud compute routers nats create datahub-nat-config \ --router-region europe-west3 --router datahub-nat-router \ --nat-all-subnet-ip-ranges --nat-external-ip-pool=datahub-nat-gateway-ip
You can find more details on Cloud NAT in the GCP documentation.
Create the Installation Host
The next step is the creation of the most important component in the installation process of SAP Data Hub: the installation host.
To prevent the installation host from changing its external IP address after a possible restart, I first reserve a dedicated static IP address:
gcloud compute addresses create inst-host-ip --region europe-west3
From the previously created GCP resources, I use the custom network and the reserved static IP address when I create a new VM instance named inst-host:
gcloud beta compute --project=[GCP_PROJECT_ID] instances create inst-host \ --zone=europe-west3-a --machine-type=n1-standard-1 --subnet=datahub-subnet-1 \ --address=inst-host-ip --network-tier=PREMIUM \ --metadata=block-project-ssh-keys=true --maintenance-policy=MIGRATE \ --service-account=[Compute Engine default service account] \ --scopes=https://www.googleapis.com/auth/cloud-platform \ --tags=inst-host-vm \ --image=sles-12-sp4-v20181212 --image-project=suse-cloud \ --boot-disk-size=100GB --boot-disk-type=pd-standard \ --boot-disk-device-name=inst-host
I am not going to explain here all the options which were used in the command above. See the gcloud reference for more details. Here are the most important ones:
- An important option in the used command is –tags=inst-host-vm. It helps me later to apply firewall rules specifically to this VM instance.
- With the option –subnet, I specify that the installation host is part of the subnet datahub-subnet-1
- The option –scopes describes the permissions of the instance. In my case, I selected the full access scope. More about the access scopes in the GCP documentation.
- With the option –metadata=block-project-ssh-keys=true, I allow only the instance-specific SSH keys and block the inheritance from the project. In my opinion, it is better option to allow access to this critical piece of the landscape only to few dedicated users.
Create required Firewall Rules
You would notice the following message when you create the custom network:
Instances on this network will not be reachable until firewall rules are created.
Therefore, I need this step to explicitly allow the SSH communication to my installation host. And it is going to be the only host which I can reach from the outside!
gcloud compute firewall-rules create insthost-allow-ssh \ --network datahub-network --source-ranges [SOURCE_IP_RANGES] \ --target-tags inst-host-vm --allow tcp:22
With this command you create a new firewall rule named insthost-allow-ssh with the following configuration:
- It applies to the network datahub-network created earlier (–network datahub-network)
- It allows incoming traffic only from IP addresses with the defined SOURCE_IP_RANGES, e.g. from inside your company’s network (–source-ranges [SOURCE_IP_RANGES])
- It applies to instances with the network tag inst-host-vm (–target-tags inst-host-vm). This tag was used in the gcloud command for the installation host creation in the previous step.
- It only allows the TCP traffic on port 22 (–allow tcp:22)
In addition to the SSH communication, I need to open the access to the HTTPS port of the SAP Host Agent, i.e. port 1129. Therefore, I create one more firewall rule with the name insthost-allow-sapha:
gcloud compute firewall-rules create insthost-allow-sapha \ --network datahub-network --source-ranges [SOURCE_IP_RANGES] \ --target-tags inst-host-vm --allow tcp:1129
Create a Private Kubernetes Cluster
Graphical View in the GCP Console
Define first the general settings for the new cluster:
Click the button Advanced edit in the node pool box to maintain the settings for the node pool:
In the Security section under the node pool, the “Read Write” authorization for the Storage API is required:
Save the configuration changes of the node pool and continue with the general cluster settings.
Expand the advanced settings and maintain the settings as in the screenshots below:
In the Networking section I define several cluster parameters which are relevant for a private cluster:
- activate the flag Enable VPC-native
- select the network datahub-network which was created earlier
- deselect the flag Automatically create secondary ranges to be able to select the address ranges pre-configured in the datahub-network
- activate the option Private cluster
- deselect the flag Access master using its external IP address. It is the most restrictive option and allows the access of the master only from internal IP addresses. In my case, I am going to access the master only from the installation host.
- define the Master IP Range (e.g. 172.16.0.0/28)
- Additionally, keep the flag Enable HTTP load balancing activated. It is required for the configuration of the Kubernetes Ingress after the SAP Data Hub installation:
In the Security settings, I deselect the flags Enable basic authentication and Issue a client certificate:
The following gcloud command is the equivalent of the UI-based configuration shown above:
gcloud beta container clusters create "datahub-cluster" \ --project "[GCP_PROJECT_ID]" --zone "europe-west3-a" \ --no-enable-basic-auth --cluster-version "1.11.7-gke.4" \ --machine-type "custom-4-32768-ext" --image-type "COS" \ --disk-type "pd-standard" --disk-size "100" \ --scopes "https://www.googleapis.com/auth/devstorage.read_write","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \ --num-nodes "3" --enable-stackdriver-kubernetes \ --enable-private-nodes --enable-private-endpoint \ --master-ipv4-cidr "172.16.0.0/28" --enable-ip-alias \ --network "projects/[GCP_PROJECT_ID]/global/networks/datahub-network" \ --subnetwork "projects/[GCP_PROJECT_ID]/regions/europe-west3/subnetworks/datahub-subnet-1" \ --cluster-secondary-range-name "datahub-pod-range" \ --services-secondary-range-name "datahub-svc-range" \ --default-max-pods-per-node "110" --enable-master-authorized-networks \ --addons HorizontalPodAutoscaling,HttpLoadBalancing \ --no-enable-autoupgrade --enable-autorepair --maintenance-window "11:00"
You can find more detailed explanation of the used options in the gcloud reference.
Prepare the Installation Host
Connect to the Installation Host
As mentioned at the beginning all previous steps were either executed in the GCP Cloud Shell or on the local client with the installed Google Cloud SDK.
For the next steps, I must connect to the installation host via SSH. In general, there are several ways to connect to instances running in GCP. The different approaches are very well described in the GCP documentation.
Personally, I add my public SSH key to the metadata of the installation host and use the PuTTY client to connect to the external IP of this host. But you might choose a different approach.
One part of the preparation activities on the installation host is the installation and configuration of the SAP Host Agent. This part is well described in the SAP Online Documentation. So that I want to refer you to this documentation instead of repeating the whole procedure here.
In addition to the SAP Host Agent, the Installation Guide for SAP Data Hub lists third-party tools that are additionally required on the installation host. I give you below compressed instructions which steps I must execute on the installation host so that the prerequisite checks can be passed during the installation procedure of SAP Data Hub.
Please note that the SAP Data Hub installer is going to run later in the context of the root user. Therefore, the first step I do after connecting to the installation host is switching to root:
sudo su -
Install the Google Cloud SDK
Follow the GCP documentation to install the Google Cloud SDK on the installation host.
curl https://sdk.cloud.google.com | bash exec -l $SHELL
Restart the shell and execute the following command to initialize the Google Cloud SDK:
Install and configure Kubernetes command-line tool (kubectl)
There are again several ways to install kubectl. Some of them are described in the Kubernets documentation.
An additional option is to use the previously installed Google Cloud SDK and install kubectl as part of it. However, you should pay attention to the correct version of kubectl.
I run the following gcloud command to install kubectl:
gcloud components install kubectl
Before the installation is started, you are informed which version is about to be installed. In my case it is 1.11.7.
To be able to connect to the previously created Kubernetes cluster, I execute the following command to generate a new entry in the kubeconfig file:
gcloud container clusters get-credentials datahub-cluster \ --zone europe-west3-a --project [GCP_PROJECT_ID] --internal-ip
I can verify the access to my Kubernetes cluster by executing for example the command
kubectl get nodes
If everything is configured correctly, I should see the list of the worker nodes.
Install and configure Helm
The Kubernetes package manager Helm needs to be installed and configured on the installation host. You always should check the documentation for Helm versions required for the SAP Data Hub installation. By the time of writing 2.9.x was one of the valid options.
I execute the following commands as root user to download the Helm binary and to extract it:
wget https://storage.googleapis.com/kubernetes-helm/helm-v2.9.0-linux-amd64.tar.gz tar zxfv helm-v2.9.0-linux-amd64.tar.gz cp linux-amd64/helm /usr/local/bin/helm
Now I configure and initialize Helm with the following commands:
kubectl create serviceaccount tiller --namespace kube-system kubectl create clusterrolebinding tiller --clusterrole=cluster-admin --serviceaccount=kube-system:tiller helm init --service-account=tiller
After Helm initialization I run a quick check with the following command:
No errors should appear.
You can find more details on the Helm project page.
Install and configure Docker
If Docker CE is not installed on your installation host, then follow the Docker or SuSE documentation.
On my installation host Docker CE 18.06.1 is preinstalled. The Docker service is however not started.
I start the Docker service with the following command:
systemctl start docker.service
In addition, enable the Docker service to start automatically at boot time:
systemctl enable docker.service
Docker must be able to authenticate to Google Container Registry. With the following command I configure Docker to use gcloud as a Docker credential helper:
gcloud auth configure-docker
This command creates certain entries in the Docker configuration file to enable its authentication to Google Container Registry. You can find more background information in the GCP documentation.
Install the Python YAML Package (PyYAML)
In the last step of my preparation activities, I need to install the Python YAML package on the installation host:
zypper install python-yaml
Installation of SAP Data Hub
Now, the environment is prepared, and the installation of SAP Data Hub can be started. The installation process with Maintenance Planner and SL Plugin is described in Stefan Jakobi’s blog.
Exposing SAP Data Hub
If you follow Stefan’s blog or the installation guide in the SAP Documentation and you successfully complete the installation, you can use kubectl or the GCP Console to verify the status of your Kubernetes cluster and of the deployed workloads. You can see the list of Pods and Services. But this is not how you use the SAP Data Hub.
To be able to access the UI of SAP Data Hub from which you can create and run your pipelines or manage connections to data sources etc., you need to expose it externally. The required steps are specific to each Cloud provider and are described in the SAP Documentation.
I want to add here additional links to the GCP documentation where you can get some more background information:
Over the course of our tests, we could clearly identify the repeating manual tasks which needed to be performed for every new installation of SAP Data Hub. Even though many of those manual tasks can be accelerated by using command line tools instead of graphical UIs, you could improve the whole provisioning process even further if you use for example Google Cloud Deployment Manager to automatically create the required GCP resources or other alternatives like Terraform.