Skip to Content
Technical Articles
Author's profile photo Subhabrata Dasgupta

Streaming and Messaging Capabilities in SAP Data Intelligence – Part 1

In this series of blog posts, we are going to cover few interesting concepts about messaging and streaming in general and streaming capabilities available in SAP Data Intelligence.

In this first blog post, I would like to share my understanding of streaming and messaging. I would also touch upon the messaging and integration capabilities of Apache Kafka in SAP Data Intelligence context.

This blog series of two parts is going to cover the following topics :

  1. Publisher/Subscriber Pattern
    1. Message Queueing
    2. Message Streaming
  2. Messaging and Streaming capabilities in SAP Data Intelligence
    1. Kafka
    2. Amazon Web Service – SNS (Simple Notification Service) – part 2
    3. Google cloud Pub/Sub – part 2
    4. NATS – part 2
    5. MQTT – part 2
    6. WAMP – part 2

*Before we jump into streaming and messaging service in SAP Data Intelligence, let’s get a brief overview of SAP DI.

SAP Data Intelligence platform provides capabilities for managing complex data landscapes, building scalable data pipelines, infusing machine learning into applications and, managing the end to end lifecycle of these developments.

SAP Data Intelligence has combined the data orchestration capabilities of SAP Data Hub and innovation capabilities of SAP Leonardo.

Now that we have an overview of SAP Data Intelligence, let us try to understand Publish-Subscribe pattern and what are the different ways it can be implemented.

What is Pub/Sub pattern?

In a Publisher (Sender) – Subscriber (Receiver) model, the Publisher does not send the message directly to Subscriber, thus the publisher and subscriber are unaware of each other’s existence.

The term “message” can be any type of data which needs to be transferred from a sender to a receiver. A message may be an error event, delta records from transactional systems or real-time data from IoT devices.

Subject-Observer%20Pattern%20vs%20Publisher-Subscriber%20Pattern

Subject-Observer Pattern vs Publisher-Subscriber Pattern

The publisher publishes its message to an event channel and the subscriber, on getting this message publish event, reads the message from the event channel.

There are multiple modifications to this pattern, implemented by different technologies, but the core concept of decoupling the sub-systems and separation of concerns remain the same.

Now that we have some overview about the Pub/Sub pattern, let’s explore the two most common implementation of the same – Message Queueing and Message Streaming.

Message Queueing

In a Message Queueing system, as the name suggests, the publisher pushes the message to the message Queue (en-queue) and the subscriber pulls the message in FIFO mechanism (de-queue). The main catch in this implementation is that the message is deleted (de-queue) once it is read by the subscriber, so generally a single message cannot be sent to multiple subscriber as each message is processed only once by a single consumer.

Message%20Queueing

Message Queueing

A common message queueing service is RabbitMQ which implements AMQP (Advanced Message Queueing Protocol). Simple Queue Service (SQS) is also a popular queueing service provided by AWS.

Message Streaming

In Message Streaming system, there is a streaming broker which acts as a distributed append only logs file. Every new message from the publisher is appended to the end of this persistent log.

In contrast to Queueing as these logs are persisted, one message can be sent to one or more consumers. Simply put – any consumer who subscribes to this log file can access these messages.

Message%20Streaming

Message Streaming

 

In a Message Streaming service – the subscriber has the capability to move inside this log file and get the logs (messages) from any point and start reading from there. Apache Kafka is the most common and open source streaming technology. Amazon Kinesis provides a similar service.

Messaging and Streaming capabilities in SAP Data Intelligence

Kafka

As we know, Apache Kafka is a distributed streaming platform. Kafka runs as a cluster on one or more servers that can span multiple data centers. The Kafka cluster stores streams of records in categories called topics. Each record consists of a key, a value, and a timestamp.

Kafka%20Architecture

Kafka Architecture

 

At a high-level Kafka provides the following guarantees:

  • Messages sent by a producer to a particular topic partition will be appended in the order they are sent.
  • A consumer instance sees records in the order they are stored in the log.
  • For a topic with replication factor N, kafka will tolerate up to (N-1) server failures without losing any records committed to the log.

The following are some key concepts and terminologies in Kafka :

  • Producer – A Producer is a client – an application or a system – which wants to send some data. In Kafka the producer is connected and configured using Kafka producer APIs.
  • Consumers – A Consumer is a client – an application or a system – that wants to read the messages available in a particular topic.
  • Kafka cluster – As we know a Kafka broker is the responsible intermediary between a subscriber and consumer. To make a Kafka server more fault tolerant, multiple Kafka Brokers which can span multiple data centers are collectively used as a Kafka cluster.
  • Topics – Topics are logical isolation of messages in Kafka, a producer or a consumer points to a particular topic in a Kafka cluster for posting or reading a message. Messages in topic are stored in partitioned logs. We can set the replication factor for a topic to cope up with server failures. A topic with replication factor of N can tolerate up to N-1 server failuresKafka%20Topics%20and%20Partitions
  • Partitions – In a partition inside a topic, the records are ordered and any new record posted by the producer is appended to the partition. A topic can have multiple partitions.
  • Consumer groups – A consumer group is a set of customers with a specific task. Each consumer in a consumer group can only subscribe to one partition of a topic – so number of active consumer in a consumer group cannot be more than the number of partitions.

The major use cases of Kafka can be broadly visualized – as a messaging system, as a storage layer and as a stream processing service.

There are different APIs pertaining to the above three use cases provided by Kafka. In Data Intelligence SAP has delivered Kafka Producer and Kafka Consumer operators. If we have a Kafka server in place, we can connect to the broker using connection management (SAP has delivered a connection of type Kafka). After this connection is established we can use a producer or consumer operator in SAP Data Intelligence modeller to orchestrate our dataflow.

Pros and Cons of Kafka

Pros Cons
Low Latency: because it decouples the message which lets the consumer to consume that message anytime. Unavailability of monitoring tools: Apache Kafka does not contain a complete set of monitoring as well as managing tools
Fault tolerance: provide resistant to node/machine failure within the cluster. Message tweaking issues: In case, the message needs some tweaking, the performance of Kafka gets significantly reduced
Durability: Kafka offers the replication feature, which makes data or messages to persist more on the cluster over a disk Wildcard topic selection: Apache Kafka does not support wildcard topic selection.
Easily accessible: As all our data gets stored in Kafka, it becomes easily accessible to anyone. Lacks some message paradigms: Certain message paradigms such as point-to-point queues, request/reply, etc.
Scalability: The quality of Kafka to handle large amount of messages simultaneously make it a scalable software product. Clumsy behaviour: Apache Kafka most often behaves a bit clumsy when the number of queues increases in the Kafka Cluster.

In order to use Kafka in SAP DI, we first need to create a connection to our Kafka cluster.

Kafka%20connection%20in%20connection%20management

Kafka connection in connection management

An example graph is delivered by SAP for a simple kafka flow

Kafka%20Graph%20Example

Kafka Graph Example

We can configure the producer and Kafka consumer operator according to our need, following options are available for configuration in Kafka producer and consumer –

Kafka%20Producer%20Configurations

Kafka Producer Configurations

Kafka%20Consumer%20Configurations

Kafka Consumer Configurations

Conclusion

This blog post is written with the intention of providing a broader outlook towards streaming and messaging. I have tried to share my understanding on Kafka and how we can use that in SAP Data Intelligence. Please leave a comment below for any questions, feedback or anything you want to add. Hope this would be helpful in starting the SAP Data Intelligence journey.

Assigned Tags

      6 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Khavya Seshadri
      Khavya Seshadri

      Hello Subhabrata,

       

      Do you know if there is a way to test the Kafka connection in DIC?

      I tried to set up a kafka connection and had tried out a simple graph that consumes data from a Kafka topic, while we are facing issues where the graph when ran is in pending state for longer time and then it's dead with the following logs:

      -graph-runtime|dist|55|syncBarrier|node_graph.go(258)
      2022-05-12 02:10:36.009473|+0000|ERROR|Error during init: error building graph: error during init of process: component=com.sap.kafka.consumer2 process=kafkaconsumer1: kafka: client has run out of available brokers to talk to (Is your cluster reachable?) {graph_id="2d7dd1dfa7504262bed9850d206463f5",group="default"}|vflow-graph-runtime|dist|55|init|node_graph.go(182)
      2022-05-12 02:10:36.009494|+0000|ERROR|Error creating node graph, err: group initialization is failed: group=default message=error building graph: error during init of process: component=com.sap.kafka.consumer2 process=kafkaconsumer1: kafka: client has run out of available brokers to talk to (Is your cluster reachable?) {graph_id="2d7dd1dfa7504262bed9850d206463f5",group="default"}|vflow-graph-runtime|dist|55|ExecuteGroup|node_runtime.go(662)
      2022-05-12 02:10:36.009532|+0000|INFO |Stopping metrics monitor manager|vflow-graph-runtime|metrics|1|Stop|monitor_manager.go(167)
      2022-05-12 02:10:36.009543|+0000|INFO |Metrics monitor manager stopped|vflow-graph-runtime|metrics|1|Stop|monitor_manager.go(208)
      2022-05-12 02:10:36.010718|+0000|FATAL|Graph execution terminated abnormally: group initialization is failed: group=default message=error building graph: error during init of process: component=com.sap.kafka.consumer2 process=kafkaconsumer1: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)|vflow-graph-runtime|vflow-graph-runtime|1|runNode|main.go(277)

      It would be helpful if you could guide us here.

       

      Best,

      Khavya

       

      Author's profile photo Sharath Prakash
      Sharath Prakash

      Khavya Seshadri what solution was followed to overcome the problem u faced? wat was the root cause for the error?

      Author's profile photo Sharath Prakash
      Sharath Prakash

      Khavya Seshadri  wat was the root cause for the error?

      Author's profile photo Rajesh PS
      Rajesh PS

      Subhabrata Dasgupta Khavya Seshadri

      I too same very similar issue .Did you solve it ?
      Failed to initialize 'com.sap.kafka.consumer2': kafka: client has run out of available brokers to talk to (Is your cluster reachable?) {graph_id="ff8890e576774aec87afcb7ba4ebf9ec",group="default",operator_id="kafkaconsumer1"}|vflow-graph-runtime|com.sap.kafka.consumer2]
      Author's profile photo Sharath Prakash
      Sharath Prakash

      Rajesh PS What solution did u apply to overcome the issue?

      Author's profile photo Satyajit Samal
      Satyajit Samal

      Hi Subhabrata, Could you please share the link on part -2 where you have covered NATS producer .