Technical Articles
Streaming and Messaging Capabilities in SAP Data Intelligence – Part 1
In this series of blog posts, we are going to cover few interesting concepts about messaging and streaming in general and streaming capabilities available in SAP Data Intelligence.
In this first blog post, I would like to share my understanding of streaming and messaging. I would also touch upon the messaging and integration capabilities of Apache Kafka in SAP Data Intelligence context.
This blog series of two parts is going to cover the following topics :
- Publisher/Subscriber Pattern
- Message Queueing
- Message Streaming
- Messaging and Streaming capabilities in SAP Data Intelligence
- Kafka
- Amazon Web Service – SNS (Simple Notification Service) – part 2
- Google cloud Pub/Sub – part 2
- NATS – part 2
- MQTT – part 2
- WAMP – part 2
*Before we jump into streaming and messaging service in SAP Data Intelligence, let’s get a brief overview of SAP DI.
SAP Data Intelligence platform provides capabilities for managing complex data landscapes, building scalable data pipelines, infusing machine learning into applications and, managing the end to end lifecycle of these developments.
SAP Data Intelligence has combined the data orchestration capabilities of SAP Data Hub and innovation capabilities of SAP Leonardo.
Now that we have an overview of SAP Data Intelligence, let us try to understand Publish-Subscribe pattern and what are the different ways it can be implemented.
What is Pub/Sub pattern?
In a Publisher (Sender) – Subscriber (Receiver) model, the Publisher does not send the message directly to Subscriber, thus the publisher and subscriber are unaware of each other’s existence.
The term “message” can be any type of data which needs to be transferred from a sender to a receiver. A message may be an error event, delta records from transactional systems or real-time data from IoT devices.
Subject-Observer Pattern vs Publisher-Subscriber Pattern
Now that we have some overview about the Pub/Sub pattern, let’s explore the two most common implementation of the same – Message Queueing and Message Streaming.
Message Queueing
In a Message Queueing system, as the name suggests, the publisher pushes the message to the message Queue (en-queue) and the subscriber pulls the message in FIFO mechanism (de-queue). The main catch in this implementation is that the message is deleted (de-queue) once it is read by the subscriber, so generally a single message cannot be sent to multiple subscriber as each message is processed only once by a single consumer.
Message Queueing
A common message queueing service is RabbitMQ which implements AMQP (Advanced Message Queueing Protocol). Simple Queue Service (SQS) is also a popular queueing service provided by AWS.
Message Streaming
In Message Streaming system, there is a streaming broker which acts as a distributed append only logs file. Every new message from the publisher is appended to the end of this persistent log.
In contrast to Queueing as these logs are persisted, one message can be sent to one or more consumers. Simply put – any consumer who subscribes to this log file can access these messages.
Message Streaming
In a Message Streaming service – the subscriber has the capability to move inside this log file and get the logs (messages) from any point and start reading from there. Apache Kafka is the most common and open source streaming technology. Amazon Kinesis provides a similar service.
Messaging and Streaming capabilities in SAP Data Intelligence
Kafka
As we know, Apache Kafka is a distributed streaming platform. Kafka runs as a cluster on one or more servers that can span multiple data centers. The Kafka cluster stores streams of records in categories called topics. Each record consists of a key, a value, and a timestamp.
Kafka Architecture
At a high-level Kafka provides the following guarantees:
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent.
- A consumer instance sees records in the order they are stored in the log.
- For a topic with replication factor N, kafka will tolerate up to (N-1) server failures without losing any records committed to the log.
The following are some key concepts and terminologies in Kafka :
- Producer – A Producer is a client – an application or a system – which wants to send some data. In Kafka the producer is connected and configured using Kafka producer APIs.
- Consumers – A Consumer is a client – an application or a system – that wants to read the messages available in a particular topic.
- Kafka cluster – As we know a Kafka broker is the responsible intermediary between a subscriber and consumer. To make a Kafka server more fault tolerant, multiple Kafka Brokers which can span multiple data centers are collectively used as a Kafka cluster.
- Topics – Topics are logical isolation of messages in Kafka, a producer or a consumer points to a particular topic in a Kafka cluster for posting or reading a message. Messages in topic are stored in partitioned logs. We can set the replication factor for a topic to cope up with server failures. A topic with replication factor of N can tolerate up to N-1 server failures
- Partitions – In a partition inside a topic, the records are ordered and any new record posted by the producer is appended to the partition. A topic can have multiple partitions.
- Consumer groups – A consumer group is a set of customers with a specific task. Each consumer in a consumer group can only subscribe to one partition of a topic – so number of active consumer in a consumer group cannot be more than the number of partitions.
The major use cases of Kafka can be broadly visualized – as a messaging system, as a storage layer and as a stream processing service.
There are different APIs pertaining to the above three use cases provided by Kafka. In Data Intelligence SAP has delivered Kafka Producer and Kafka Consumer operators. If we have a Kafka server in place, we can connect to the broker using connection management (SAP has delivered a connection of type Kafka). After this connection is established we can use a producer or consumer operator in SAP Data Intelligence modeller to orchestrate our dataflow.
Pros and Cons of Kafka
In order to use Kafka in SAP DI, we first need to create a connection to our Kafka cluster.
Kafka connection in connection management
An example graph is delivered by SAP for a simple kafka flow
Kafka Graph Example
We can configure the producer and Kafka consumer operator according to our need, following options are available for configuration in Kafka producer and consumer –
Kafka Producer Configurations
Kafka Consumer Configurations
Conclusion
This blog post is written with the intention of providing a broader outlook towards streaming and messaging. I have tried to share my understanding on Kafka and how we can use that in SAP Data Intelligence. Please leave a comment below for any questions, feedback or anything you want to add. Hope this would be helpful in starting the SAP Data Intelligence journey.
Hello Subhabrata,
Do you know if there is a way to test the Kafka connection in DIC?
I tried to set up a kafka connection and had tried out a simple graph that consumes data from a Kafka topic, while we are facing issues where the graph when ran is in pending state for longer time and then it's dead with the following logs:
It would be helpful if you could guide us here.
Best,
Khavya
Khavya Seshadri what solution was followed to overcome the problem u faced? wat was the root cause for the error?
Khavya Seshadri wat was the root cause for the error?
Subhabrata Dasgupta Khavya Seshadri
Rajesh PS What solution did u apply to overcome the issue?
Hi Subhabrata, Could you please share the link on part -2 where you have covered NATS producer .