Skip to Content
Technical Articles

SAP Data Hub – A Study in Graphs

Data (processing) pipelines and data workflows are in the center of SAP Data Hub. And I like to dedicate my next few blog posts to them.

You might say: “Hey, wait… why are you talking about data pipelines and data workflows in a blog post about graphs?”. I will explain that in a few minutes.

If you like to learn step-by-step how to build data pipelines and data workflows, then take a look at the documentation, use SAP Data Hub, developer edition / trial edition or read Jens Rannacher’s excellent series of tutorials.

In my opinion, the basics of data pipelines and data workflows are easy to grasp. Really understanding what’s happening “under the hood” requires you to have some Sherlock genes to investigate this. Let’s start our investigation.

Hello, World!

SAP Data Hub (more precisely speaking, the SAP Data Hub Modeler) allows you to build graphs. These are executed as container(s) on a Kubernetes cluster. The Modeling Guide for SAP Data Hub says:

A graph is a network of operators connected to each other using typed input ports and output ports for data transfer.

An operator represents a vertex of a graph. An operator is a reactive component, hence is not intended to terminate, and reacts only to events from the environment.

An event from the environment is a message delivered to the operator through its input ports. The operator can interact with the environment through its output ports. The operators are unaware of the graph in which it is defined and the source and target of its incoming and outgoing connections.

I like to reiterate one aspect: Operators – and hence also graphs – are not intended to terminate. Per se, graphs run “forever”. You need to explicitly terminate graphs (either via SAP Data Hub Modeler or by modeling the termination inside the graph).

Example: Hello, World!

Let’s look at a few example graphs. These graphs are not intended to explain how you ingest or transform data. The sole intention behind them is to illustrate how graphs work.

The following I like to call the “Hello, World!” graph (link to GitHub). This “Hello, World!” graph consists of a Constant Generator operator and a Terminal operator. They communicate between the out (output) port and the in1 (input) port:

Let’s configure the Constant Generator operator to emit the string “Hello, World!” through the out (output) port every second. The configuration looks like this:

The string “Hello, World!” is delivered to the Terminal operator through the in1 (input) port:

When you run the graph, it goes into status “pending” (being prepared) and then “running”. When you now look at the output of the Terminal operator, you see something like this:

The graph will continue to run “forever”. You have to explicitly terminate it. After terminating it, the graph goes into status “completed” (in case of problems with the graph the status will be “dead”):

Example: Chat

Next, let’s change our graph a bit (link to GitHub). First adapt the configuration of the Constant Generator operator to only emit the string “Hello, World!” when data is delivered through the in (input) port:

Second connect the out1 (output) port of the Terminal operator to the in (input) port of the Constant Generator operator:

When you run the graph and look at the output of the terminal, you initially see nothing. Enter “Hello, There!” and see what happens:

The string “Hello, There!” (1) is sent to the Constant Generator operator. And this operator sends (back) the string “Hello, World!” (2).

Example: Chat with Termination

Last, let’s see how we can terminate the graph as soon as you enter “Bye!” in the terminal (link to GitHub). Thereto add a JavaScript operator and a Graph Terminator:

The JavaScript operator has one input port in1 as well as two output ports out1 and out2. All ports are typed as string.

The following coding inside the JavaScript operator ensures that whatever you enter in the terminal, is sent to the Constant Generator operator, except when you enter “Bye!”. “Bye!” is sent to the Graph Terminator:

$.setPortCallback("in1",onInput);

function onInput(ctx,s) {
  if(s!="Bye!") {
    $.out1(s)   
  }
  else
  {
    $.out2(s)
  }
}

Run the graph and see what happens:

When you enter “Hello, There!” (1), then “Hello, World!” (2) appears. When you enter “Bye!” (3), nothing appears and the graph goes into status “completed”.

Graphs vs. Pipelines vs. Workflows

Now, let me come back to the question about what the differences between graphs, data pipelines and data workflows are.

On a conceptional level and in UML, I like to describe the relationship between the three terms like this:

Graphs can be data pipelines or data workflows. The other way around, data pipelines and data workflows are specializations of graphs:

  • Data pipelines are used to process data. And hence pipeline operators (i.e. operators used in a data pipeline) typically receive and send data (that’s a broad term and for the sake of this blog, I will not define it in detail). Often data pipelines are non-terminating graphs.
  • Data workflows are used to orchestrate (data processing) tasks (potentially across systems). They use workflow operators, which receive and send status messages / triggers, and have a defined start and end.

On a technical level and pragmatically, I tend to say:

  • SAP Data Hub itself currently does not accurately separate data pipelines and data workflows. In many places it only “knows” graphs.
  • Whether “something” is a data pipeline or a data workflow depends on the modeling (and the operators used inside the graph).
  • SAP Data Hub does include dedicated workflow operators which follow certain conventions:
  • When (only) workflow operators are used in a graph, it is fair to consider this graph to be a data workflow.

Example: Data Pipeline

The following graph (which is delivered as com.sap.demo.hana with every SAP Data Hub installation) is a data pipeline, because it processes data. And this data “flows through” the graph, i.e. from the output port of the Data Generator via the 1:2 Multiplexer to the data port of the SAP HANA Client:

We can see the data “flowing through” the graph by adding a Wiretap operator (link to GitHub):

Example: Data Workflow

The following graph (which is delivered as com.sap.demo.dataworkflow) is a data workflow with start (defined by the Workflow Trigger operator) and end (defined by the Workflow Terminator operator). It “only” orchestrates tasks (one task, to be precise) and the operators used in the graph exchange “only” status messages / triggers:

Again, we can use a Wiretap operator (or two this time) to inspect the exchanged status messages / triggers (link to GitHub):

Let’s extend the example a bit and then take a closer look at what happens “under the hood” when several tasks are orchestrated by the data workflow (link to GitHub; in addition you need this and this graph as well):

  • The data workflow consists of three tasks. Two of them are executed sequentially. The third task is executed in parallel to the other two.
  • Each task executes a data pipeline (waiting 10 seconds respectively 30 seconds before being terminated).

Let’s now run the extended data workflow and take a look at the status pane (Show Subgraphs must be set). At the end, the status pane will look like this:

The status pane is very insightful with regards to how our data workflow (or workflows in general) is executed:

  • In our example, the workflow (in blue) has triggered three “internal” subgraphs.
  • Each of the subgraphs (in yellow) has taken care to execute one of the data pipelines specified in the workflow.
  • The executed pipelines are also visible in the status pane (in green).

“Pipeflows” and “Worklines”

Please don’t google for “pipeflows” and “worklines”. These do not exist (in the context of SAP Data Hub):

I have made up these terms to ask the question: Can pipeline (i.e. non-workflow) operators and workflow operators be combined in one graph?

The answer is simple: yes, but currently it is not recommended (see Modeling Guide for SAP Data Hub).

From my perspective there are two good reasons for this recommendation:

  • As you have seen, data pipeline (i.e. non-workflow) operators exchange data while workflow operators exchange status messages / triggers. Often it does not make sense to mix them.
  • By separating data pipelines and data workflows, you clearly distinguish between data processing and task orchestration.

BUT… I have also said, that SAP Data Hub itself currently does not accurately separate data pipelines and data workflows. And the exception proves the rule!

Also look at workflow operators like Data Transfer and Data Transform. They clearly do more than orchestration (but they still only exchange status messages / triggers with their “surrounding”). The boundary between pipeline (i.e. non-workflow) and workflow operators is fluent.

Hence if you have a good reason to combine pipeline (i.e. non-workflow) and workflow operators in one graph, then SAP Data Hub will not stop you from doing so. And SAP Data Hub might even improve the “interoperability” between different types of operators in the future.

That’s it for today. Happy new year and all the best for 2019! And happy “data-hubbing” of course…

Be the first to leave a comment
You must be Logged on to comment or reply to a post.