Skip to Content
Technical Articles
Author's profile photo Thorsten Schneider

SAP Data Hub – a containerized application

Often, we say that SAP Data Hub is a containerized application. That sounds beneficial and cool. But what does it really mean? During this blog post I try to explain it.

To get things straight right from the beginning: this blog post does not intend to be an introduction into container technology. I assume that you have at least a basic understanding of containers (including Docker and Kubernetes).

Nevertheless, I will do a very brief – and very much simplified – recap of container technology. If the following is all Greek to you (or as the Germans say “I only understand train station”), then you should probably google for Docker and Kubernetes first. Afterwards you can come back here.

Containers, Docker, Kubernetes

A container is a “standardized unit of software” (source: Docker). It is based on a template (container image) and runs on top of a container runtime (daemon). Container images can be stored and made available via registries (like Docker Hub is one). This is a simplified diagram of a computer / virtual machine running containers:

Kubernetes is a software that helps you to manage (many) containers. It is typically, but not necessarily installed on a computer cluster consisting of multiple physical computers or virtual machines. One of these manages the overall cluster (master), the others run the actual workloads (nodes). Again, let’s use a simplified diagram to depict a Kubernetes cluster:

If you are interested to learn more about the Kubernetes components installed on the master and each of the nodes, then refer to the Kubernetes documentation.

To help you with the management of your containers, Kubernetes allows you to describe the desired state of the cluster. Kubernetes continuously compares the current state with the desired state and makes necessary adjustments in case both deviate from each other (with the goal to bring the current state to the desired state).

You describe the desired state by a set of objects. The objects can be described by means of .yaml or .json files. Some important objects are:

  • Pods: A pod is the smallest deployable unit in Kubernetes. It can consist of one or multiple containers.
  • Services: A service is used to expose one or multiple pods inside or outside the cluster. It distributes requests to the “underlying” pod(s).
  • Replica Sets: If you like to run multiple replicas of a pod, you can use a replica set. The replica set ensures that the desired number of replicas is running at any time.
  • Deployments: A deployment can manage pods and replica sets. It can help you to easily rollout changes (and if necessary also roll them back).
  • Stateful Sets: A stateful set is similar to a deployment. It is tailored for stateful applications (e.g. databases).
  • Daemon Sets: To ensure that a copy of a pod runs on every node of a Kubernetes cluster, you can define a daemon set.
  • Persistent Volumes / Persistent Volume Claims: Persistent volumes and persistent volume claims are means for pods to store data on disk.

For a complete list of all available objects, you can take a look at the Kubernetes documentation.

A containerized application

SAP Data Hub 2.3 consists of the SAP Data Hub Foundation and the SAP Data Hub Spark Extensions. The SAP Data Hub Foundation is containerized, i.e. each component of the SAP Data Hub Foundation runs as one or multiple containers on a Kubernetes cluster.

The SAP Data Hub Spark Extensions are not containerized. They are optional and (if used) installed on a Hadoop cluster. I will not consider them for this blog post.

The following diagram visualizes the architecture of SAP Data Hub Foundation and its most important components (it is largely along the SAP Data Hub documentation):

SAP Data Hub makes use of all aforementioned Kubernetes objects (pods, services, deployments…).

Taking a look at the Kubernetes dashboard

To make things tangible, let’s use the Kubernetes Dashboard, a simple web-based user interface, to take a look at a Kubernetes cluster with SAP Data Hub installed (if you are interested to follow my explanations, you can use SAP Data Hub, trial edition to quickly spin up a system).

The following screenshot shows the list of pods running for SAP Data Hub (all pods run in the datahub namespace; namespaces can be used “to divide cluster resources between multiple users”, source: Kubernetes):

Let’s pick one component of SAP Data Hub and analyze how this is represented in Kubernetes: SAP Data Hub Connection Management. Filter for a pod with connection (1) in the name. You will see at least one pod (for the unlikely case that there is no pod, first start connection management via the SAP Data Hub user interface):

Click on the hyperlink (2). This will open a page with the details of the pod (I have cut out several parts, which are not important for my explanations, from the following screenshot):

At the beginning of the page you can see the node (3) of the Kubernetes cluster where the pod is running. You see that the pod consists of one container. And this container is based on container image app-base:2.3.99 (4).

You can also see that the container was created by a replica set (5). As explained before, a replica set ensures that the desired number of (pod) replicas is running at any time.

And, finally you can see the persistent volume claims (6) which have been requested by the pod (in our example the pod has requested two persistent volume claims).

Now, let’ have some fun and delete the pod by pressing the Delete (7) button. What do you expect to happen?

Right! The replica set will start a new pod for connection management. You can see this pod in the list of pods running for SAP Data Hub. Pay attention to the icon in front of the pod name. This indicates if the pod is ready (or not):

Containers starting containers starting containers

So far so good. The interesting thing is, that containers running (parts of) SAP Data Hub can also start new containers. And these new containers can start new containers again. The following table illustrates this:

You build and run a pipeline. SAP Data Hub Modeler stats a container (pod) for the pipeline.
You start the modeling environment. SAP Data Hub System Management starts a container (pod) for the Modeler.
You install the system. The container (or more precisely speaking pod) for SAP Data Hub System Management is started.

Maybe that is a topic which I will examine in more detail during one of my next blog posts. There will be much to explain. For today, I hope that you have enjoyed this blog post and understand what it means that SAP Data Hub is containerized.

Assigned Tags

      6 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Bartosz Jarkowski
      Bartosz Jarkowski

      Great introduction into SAP Data Hub and containers! Thanks a lot!

       

      Author's profile photo Nabheet Madan
      Nabheet Madan

      Great stuff Thorsten Schneider, looking forward to other blogs in the series. #ContainerIsTheWayToBe

      Author's profile photo John Graham
      John Graham

       

      Great blog.   Question.   You wrote

      "The SAP Data Hub Spark Extensions are not containerized. They are optional and (if used) installed on a Hadoop cluster"

       

      In Swapan's also great blog he wrote reagarding  Hadoop cluster  " Starting with this release, all necessary components including SAP HANA and SAP Vora’s distributed runtime engines are delivered containerized via a Docker registry. This removes the need to install ... a Hadoop cluster for Vora’s runtime executions."

       

      can you explain the difference?

      https://blogs.saphana.com/2018/10/02/introducing-sap-data-hub-2-3/

      Author's profile photo Thorsten Schneider
      Thorsten Schneider
      Blog Post Author

      Hi John,

      both (Swapan's blog and mine) is true. I try to clarify.

      a) With release 2.3 Hadoop is an optional component for SAP Data Hub. If you have Hadoop, you can use it / connect it. If not, there is no need to run a Hadoop cluster. Having said that, we basically treat HDFS the same as GCS, S3 etc.

      b) There is one - as said - optional component: the Spark Extensions. You can use them, simply spoken, to connect from Spark environment to SAP Data Hub. The Spark Extenions need to be installed on the Hadoop cluster. You do NOT need the Spark Extensions to read/write to HDFS. That works the the Foundation (running on Kubernetes).

      I hope that clarifies it.

      Cheers

      Thorsten

      Author's profile photo John Graham
      John Graham

       

      thank you for clarification!   And also my mistake Mark Hartz wrote that blog,  Swapan wrote another useful blog that linked to it.

      Author's profile photo Narsimha Kantipudi
      Narsimha Kantipudi

      Thank you Thorsten for this interesting blog to understand containers and Kubernetes.