Next Generation RealTime Data Integration - Archit...

werner_daehn · ‎02-11-2020

Why are there so many Data Integration solutions? Even SAP has

Data Services

System Landscape Transformation (SLT)

Hana Smart Data Integration

Hana Smart Data Access

Cloud Platform Integration (CPI-DS, CPI-PI)

Business Hub

SAP (Sybase) Replication Server

Data Hub

Open Connectors from Cloud Elements

Process Orchestration

....and many more

To help the customer pick the solution with the proper capabilities, a Solution Advisor Tool (ISA-M) was rolled out.

(screenshot taken from above ISA-M link)

In contrast, the customers have simple requirements: Get the data from one system into a second easily and with low latency.

It is as if the customer wants to plug a power tool into a socket and the vendor explains the different types of power plants, electric grid technologies, which socket supports this particular device.

Question: What is the fundamental ...

difference between Process Integration, Data Integration, IoT Integration and User Workflows?

difference between a system running onPrem and in the Cloud ?

difference between calling APIs to interact with a service broker, with an App or to access Data?

If there is no real difference, the million dollar question is, does it have to be that way? A few years back I would have said yes, unfortunately there are inhibiting technical factors. But with the concepts and tools invented for the Big Data world, not any longer.

Yes, instead of having multiple tools with different and incompatible capabilities, finally a single solution can solve it all.

The start of the story

Question: How can a customer get data from the SAP ERP system?

As said initially, the customer requirement was to have a simple and low latency (Realtime) integration. A comfortable option would be to ask the ERP system: Which sales orders have been changed within the last 10 seconds?

From a technical point of view this sentence unveils two unsolved problems:

What has changed? In the ERP system there is no common way to identify changes. Every ERP module is using different methods, for most data elements there is no change indication at all. Same with most other systems, SAP and non-SAP.

What comprises a sales order? The ERP system has many tables but does not expose what a Sale Order consists of. There are traces of that in CDS Views or IDOCs but that's it.

To satisfy the customer requirements an addition to the ERP system is needed that tracks the changes and can assemble a business entity (=the Sales Order in this example). This component is called the SAP ERP Producer. A demo instance runs here.

Optimization for Multipoint data integration

Question: In today's world, how many systems would like to consume SAP ERP data, one or many?

In the past the answer has been a single one. SAP ERP to SAP BW. SAP ERP to SAP ByDesign. That was not true even in the past, but nowadays it got worse. There are many cloud apps that need the data, even cloud solutions provided by SAP need ERP data. Successfactors needs HR relevant data, Hybris, Concur, Ariba plus all the systems customers have in use,.. all require SAP ERP data.

If all of them start asking the ERP system "did something change?" the server will be busy capturing and sending the changes for every single consumer. We can do better.

Therefore the architecture has an intermediary. A system the SAP ERP Producer is pushing the change data to, a system that stores the changes and from which all consumers can get the changes. That component is called Apache Kafka. Above Producer writes into this Kafka instance.

Transactional

Another implicit requirement of realtime is transactional consistency. In the past, data has been processed per-table. SAP BW can be one example. First all Order Line Items are loaded, then all Order Headers. As a result the data is not consistent, e.g. the newest sales order got loaded but it has zero line items, because its transfer process was started a few minutes later.

This is no problem as long as the data is read together and has a clear dependency. Achieving that is tougher than it sounds but okay. However, it would be better to load the data in the same order as it was committed in the source. Then data is consistent always and automatically.

The user of the ERP system created a sales order with multiple items and saved the data? The producer will create a sales order with the line items.

The user changed the sales order a minute later? A change of the sales order will be produced.

Apache Kafka allows to read the entire stream of changes, in the proper order and with the transactional bracket of the commits.

Business Entity Schema changes

Question: What should happen when the source schema changed, e.g. a new column was added to a table?

For an ETL tool like SAP Data Services or the SAP Data Hub the answer is simple. The developer needs to be told, he then modifies the target tables plus the data flows, tests all and finally moves the code to production. Depending on the situation, next weekend an initial load of the impacted tables will be necessary.

This situation was good enough in the old days, when data was static and did not change often. Not ideal but good enough. Hence no vendor invested into any improvements in that area.

The Big Data World introduced the term Schema Evolution for this problem and defined rules how structure changes can be dealt with. One rule, for the most common case of adding a new column: The schema can get expanded by this additional column as long as the column has a default value. But there are other schema evolution options as well: data type changes, aliasing and more to deal with as many situations as possible automatically.

The clever part is how producers and consumers work together. If the producer finds an additional column, he expands the schema and stores it as a new version in the Kafka Schema Registry. Then the change record is produced and the schema version ID is part of the payload. The consumer receives the payload, finds a new version, thus loads the schema definition and can process the data. Another producer might still produce data for the old version, but as the consumer's schema has a default value for that column - a null in most cases - the consumers are not impacted by that.

In essence, the schema is no longer a fixed structure but can change in a backward compatible way at any time. What each consumer does with the additional field is up to the developer. A data lake writer will simply add this field to the target table. A target application, which does not even need that additional field, will ignore it.

Publish/Subscribe versus Produce/Consume

Question: Who should have control about the received data, the producer or the consumer?

That should be obvious. As a consumer I want to decide which data I want to consume. The point "Which data" has two meanings in this context, though: The type of data is one but the time aspect is another.

In the old publish/subscribe methodology, the consumer does listen on ("subscribe to") a queue and therefore will receive all data posted there from this point in time onward. The Produce/Consume method is more like the "tell me which sales orders have been changed in the last 10 seconds". The consumer can listen on a data topic but it can also read the topic at free will. Although it sounds almost the same, it is the core reason Publish/Subscribe Message Queuing is not broadly used.

For example, if a Subscriber processed data incorrectly, it has no way to re-read the same data again. A Kafka Consumer simply rewinds the pointer to an earlier point in time and re-reads the data again. A developer needs to read the same data multiple times during testing. A Data Warehouse like BW loads data just once a day to have stable data anyhow. Error handling, recovery,... all is way simpler in a Produce/Consumer pattern. As a result Apache Kafka, as the champion in that area, is such a hot topic at companies.

The grand finale

Coming back to the initial question: What is the difference between process integration and data integration? The data structures. In process integration the data structures are complex. They consist of deeply nested data with arrays and substructures. In data integration all data comes as flat relational tables. Process integration is slow, data integration is fast.

Apache Kafka breaks the boundary. It is fast and can process deeply nested structures. For Kafka Data Integration is just an especially trivial case with no nested elements.

Hence it is the perfect backbone for any kind of integration. What is missing are the business user friendly producers/consumers - which this project provides.