I’ve been getting this question a lot lately – especially around Kafka, but it’s always been a question that gets asked, especially by people who are new to “event processing”, “complex event processing” and “streaming analytics”. But first a comment on why I put all those terms in quotes – because I think it’s relevant to this discussion.
One challenge for SAP HANA smart data streaming (SDS) is that there is no widely accepted industry term to define this technology. Various labels apply – including all those that I put in quotes. We tend to refer to it as event stream processing, but there is definite merit to the term Forrester uses, which is streaming analytics. And one of the things that I like about the term “streaming analytics” is that it actually helps distinguish it from messageing technology.
So back to the original question. And specifically, what I’m getting asked a lot lately is: “How does SDS compare to Kafka?”. So here goes…
First the easy bit:
Apache Kafka is a message broker. Think JMS, MQ, AMQP, etc. I won’t get into the differences between various messaging technologies here. There are differences between message queues, message buses, and there are differences in the types of patterns supported. But all messaging technologies address the problem of delivering messages from producers to consumers. Producers send messages to the broker, and the broker holds them in a queue where consumers can read them (or in the case of a bus, delivers them to all subscribers).
The key point here is that the message broker simply delivers the messages unchanged from producers to consumers. Think of the post office (though with the added ability of one-to-many distribution). Or Twitter, which is really just a message broker.
HANA SDS is designed for streaming analytics. It receives messages (we tend to talk about events, but the information is delivered as messages) and applies business logic to analyze or otherwise transform those raw messages into useful information. Bottom line: in most cases, the output events (messages) from SDS are different from the input events. The focus on SDS is not on simply delivering the events but analyzing or transforming them.
In terms of analyzing and transforming the data, just to add a bit of clarity, here are a few of the common things that SDS is used for:
- filter the incoming data to only look at data of interest. This can be simple value based filtering, but can extend to complex, dynamic filtering logic
- aggregate the incoming data. This can be used to change the data frequency by sampling the data, or can be used to monitor trends, current positions, etc
- watch for patterns of events – situation detection. This is typically used for alerting or real-time response to emerging situations. Predictive maintenance is an example, as is fraud detection
- transform and enrich the data, getting it into the desired structure and adding context or other information to make it meaningful
So in fact, SDS is often used in conjunction with a message broker. Inputs to SDS may flow from the producers to SDS via a message broker, and the outputs of SDS may be distributed to destinations via a message broker. In fact I would guess that probably as many as half of the SDS (and ESP) deployments using messaging technology – most often as an input channel to SDS/ESP, but in some cases for distribution of output as well.
One analogy I heard long ago (apologies to the originator – I can’t remember where I heard it) is to compare the combination of Event Processing and Messaging to the central nervous system, where Messaging is the nerves, transmitting “data” from the finger tips to the brain, and the Event Processing engine is the brain, making sense of all those messages.
While it’s tempting to stop there, I do need to acknowledge that there is overlap. Yes, SDS can be used to “route” messages from producer to consumer, but the semantics are very different. We’re definitely seeing interest in using SDS to apply rules to incoming data to determine which data gets put into HANA in-memory tables, which data goes into HANA extended tables, and which data goes to Hadoop, but in most of those cases SDS is doing more to the data than just delivering it to the desired destination(s).
Bottom line: while there’s some overlap, they were designed for different purposes. So in general – use the best tool for the job. After all, the technology world is full of overlapping technologies – and that’s generally the best rule to follow. And in many cases, use them together