One of the first things to take into account when figuring out the best approach to change data capture is identifying how the data going to be delivered to the Data Services engine. Do you have to go fetch the data or is it dropped in Data Services lap. From a technology point of view usually when data is pushed it is traveling via a message bus, API or Web Service and when pulling data Data Services connects to the source systems and pulls it out.
Typically, but not always if a customer is using the push method, Data Services will be running real-time jobs that listen out for messages to arrive along a queue. In this scenario customers “usually” are sending already identified changes (inserts/updates/deletes) along the message bus in XML packages. (One example of this could be real-time SAP idoc’s over ALE).
Most of the hard work is done here for the Data Services developer, the changes are pre-identified and delivered and all that needs doing is deciding on how to apply the delta load to the target and perhaps add a few data quality and transformations along the way! This method is the way a real-time data requirement can be fulfilled using Data Services and is something more customers are looking to do to achieve near real-time data integration.
This is the more traditional method of acquiring data and one that every ETL developer is familiar with, Data Services instigates the extraction process from the source system. This is typically a scheduled job set to run at an interval. These intervals can vary from an evening extraction to a batch process that runs at regular intervals such as the micro-batch in the latest version of Data Services. The Data Services Micro-batch will keep polling the source data until certain conditions are met, this delivers a lower level of time based granularity and simplifies the process of configuring this approach. Micro-batching can be really useful when customers are looking to trickle feed the data to a target but may not be able to switch on CDC functionality in the source system, micro batching is platform independent where as CDC in Data Services is specific to the type of database being sourced.
When using a pull method the logic to identify changes is more often then not built using standard functions within Data Services such as table comparison or CDC features, however, other options include polling directories for change files, working with Sybase replication server.
So to recap, the reason why customers would consider using the push method with Data services when:
- They require data to be moved and processed in real-time. An example of this would be real-time address confirmation and cleansing from an ecommerce web portal through web service request to the Data Services engine.
- When they have a message queue or SOA architecture mechanism for data delivery.
- Need real time interactions with the data.
- Data is typically being processing from a single primary source (reaching out to other systems could introduce unwanted latency into the real-time process)
Pull methods are used where:
- Data latency is less of an issue and the data is processed at different intervals possibly via a schedule.
- The extracts are driven from Data Services through a batch process.
- High volume bulk data.
- Data is integrated from many sources providing complex cross system integration.
- Change data identification needs to happen within Data Services
That said there is no technical limitation on the number of sources and targets that can be used within a single batch or real-time dataflow and this is more of a general rule of thumb. Something to keep in mind is that Data Services primary focus is on data integration and not process integration between different applications within a business architecture. This is where a customer would adopt SAP Process Orchestration (PO).
In the next post I will start delving into Source versus Target based change data capture.