Data (processing) pipelines and data workflows are in the center of SAP Data Hub. And I like to dedicate my next few blog posts to them.
You might say: “
Hey, wait… why are you talking about data pipelines and data workflows in a blog post about graphs?”. I will explain that in a few minutes.
If you like to learn step-by-step how to build data pipelines and data workflows, then take a look at the
documentation, use SAP Data Hub,
developer edition /
trial edition or read
jens.rannacher excellent series of tutorials.
In my opinion, the basics of data pipelines and data workflows are easy to grasp. Really understanding what’s happening “under the hood” requires you to have some Sherlock genes to investigate this. Let’s start our investigation.
Hello, World!
SAP Data Hub (more precisely speaking, the SAP Data Hub Modeler) allows you to build graphs. These are executed as container(s) on a Kubernetes cluster. The
Modeling Guide for SAP Data Hub says:
I like to reiterate one aspect: Operators – and hence also graphs – are
not intended to terminate. Per se, graphs run “forever”. You need to explicitly terminate graphs (either via SAP Data Hub Modeler or by modeling the termination inside the graph).
Example: Hello, World!
Let’s look at a few example graphs. These graphs are not intended to explain how you ingest or transform data. The sole intention behind them is to illustrate how graphs work.
The following I like to call the “Hello, World!” graph (link to
GitHub). This “Hello, World!” graph consists of a
Constant Generator operator and a
Terminal operator. They communicate between the
out (output) port and the
in1 (input) port:
Let’s configure the
Constant Generator operator to emit the string “Hello, World!” through the
out (output) port every second. The configuration looks like this:
The string “Hello, World!” is delivered to the
Terminal operator through the
in1 (input) port:
When you run the graph, it goes into status “pending” (being prepared) and then “running”. When you now look at the output of the
Terminal operator, you see something like this:
The graph will continue to run “forever”. You have to explicitly terminate it. After terminating it, the graph goes into status “completed” (in case of problems with the graph the status will be “dead”):
Example: Chat
Next, let’s change our graph a bit (link to
GitHub). First adapt the configuration of the
Constant Generator operator to only emit the string “Hello, World!” when data is delivered through the
in (input) port:
Second connect the
out1 (output) port of the
Terminal operator to the
in (input) port of the
Constant Generator operator:
When you run the graph and look at the output of the terminal, you initially see nothing. Enter “Hello, There!” and see what happens:
The string “Hello, There!” (1) is sent to the
Constant Generator operator. And this operator sends (back) the string “Hello, World!” (2).
Example: Chat with Termination
Last, let’s see how we can terminate the graph as soon as you enter “Bye!” in the terminal (link to
GitHub). Thereto add a
JavaScript operator and a
Graph Terminator:
The
JavaScript operator has one input port
in1 as well as two output ports
out1 and
out2. All ports are typed as
string.
The following coding inside the
JavaScript operator ensures that whatever you enter in the terminal, is sent to the
Constant Generator operator,
except when you enter “Bye!”. “Bye!” is sent to the
Graph Terminator:
$.setPortCallback("in1",onInput);
function onInput(ctx,s) {
if(s!="Bye!") {
$.out1(s)
}
else
{
$.out2(s)
}
}
Run the graph and see what happens:
When you enter “Hello, There!” (1), then “Hello, World!” (2) appears. When you enter “Bye!” (3), nothing appears and the graph goes into status “completed”.
Graphs vs. Pipelines vs. Workflows
Now, let me come back to the question about what the differences between graphs, data pipelines and data workflows are.
On a conceptional level and in UML, I like to describe the relationship between the three terms like this:
Graphs can be data pipelines
or data workflows. The other way around, data pipelines and data workflows are
specializations of graphs:
- Data pipelines are used to process data. And hence pipeline operators (i.e. operators used in a data pipeline) typically receive and send data (that’s a broad term and for the sake of this blog, I will not define it in detail). Often data pipelines are non-terminating graphs.
- Data workflows are used to orchestrate (data processing) tasks (potentially across systems). They use workflow operators, which receive and send status messages / triggers, and have a defined start and end.
On a technical level and pragmatically, I tend to say:
- SAP Data Hub itself currently does not accurately separate data pipelines and data workflows. In many places it only “knows” graphs.
- Whether “something” is a data pipeline or a data workflow depends on the modeling (and the operators used inside the graph).
- SAP Data Hub does include dedicated workflow operators which follow certain conventions:
- When (only) workflow operators are used in a graph, it is fair to consider this graph to be a data workflow.
Example: Data Pipeline
The following graph (which is delivered as
com.sap.demo.hana with every SAP Data Hub installation) is a data pipeline, because it processes data. And this data “flows through” the graph, i.e. from the
output port of the
Data Generator via the
1:2 Multiplexer to the
data port of the
SAP HANA Client:
We can see the data “flowing through” the graph by adding a
Wiretap operator (link to
GitHub😞
Example: Data Workflow
The following graph (which is delivered as
com.sap.demo.dataworkflow) is a data workflow with
start (defined by the
Workflow Trigger operator) and
end (defined by the
Workflow Terminator operator). It “only”
orchestrates tasks (one task, to be precise) and the operators used in the graph exchange “only”
status messages / triggers:
Again, we can use a
Wiretap operator (or two this time) to inspect the exchanged status messages / triggers (link to
GitHub😞
Let’s extend the example a bit and then take a closer look at what happens “under the hood” when several tasks are orchestrated by the data workflow (link to
GitHub; in addition you need
this and
this graph as well):
- The data workflow consists of three tasks. Two of them are executed sequentially. The third task is executed in parallel to the other two.
- Each task executes a data pipeline (waiting 10 seconds respectively 30 seconds before being terminated).
Let’s now run the extended data workflow and take a look at the status pane (
Show Subgraphs must be set). At the end, the status pane will look like this:
The status pane is very insightful with regards to how our data workflow (or workflows in general) is executed:
- In our example, the workflow (in blue) has triggered three “internal” subgraphs.
- Each of the subgraphs (in yellow) has taken care to execute one of the data pipelines specified in the workflow.
- The executed pipelines are also visible in the status pane (in green).
“Pipeflows” and “Worklines”
Please don’t google for “pipeflows” and “worklines”. These do not exist (in the context of SAP Data Hub):
I have made up these terms to ask the question:
Can pipeline (i.e. non-workflow) operators and workflow operators be combined in one graph?
The answer is simple: yes, but currently it is not recommended (see
Modeling Guide for SAP Data Hub).
From my perspective there are two good reasons for this recommendation:
- As you have seen, data pipeline (i.e. non-workflow) operators exchange data while workflow operators exchange status messages / triggers. Often it does not make sense to mix them.
- By separating data pipelines and data workflows, you clearly distinguish between data processing and task orchestration.
BUT… I have also said, that SAP Data Hub itself currently does
not accurately separate data pipelines and data workflows. And the exception proves the rule!
Also look at workflow operators like Data Transfer and Data Transform. They clearly do more than orchestration (but they still only exchange status messages / triggers with their “surrounding”). The boundary between pipeline (i.e. non-workflow) and workflow operators is fluent.
Hence if you have a good reason to combine pipeline (i.e. non-workflow) and workflow operators in one graph, then SAP Data Hub will not stop you from doing so. And SAP Data Hub might even improve the “interoperability” between different types of operators in the future.
That’s it for today. Happy new year and all the best for 2019! And happy “data-hubbing” of course…