My original plan for this post was to wrap up source and target based changed data capture in a single instalment unfortunately, I seem to have got carried away and will post a follow up on target based CDC in the future.
Once the method of data delivery has been established (push or pull) the next area of consideration is how can change data capture (CDC) be applied within Data Services? More often than not, when a project demands data to be extracted many times from the same source system using Data Services, the area of change data capture will be discussed. As the name suggests CDC is all about identifying what has changed in the source system since the data was previously extracted and then only pulling the new or modified records in the next extraction process. The net result is that effective CDC enables users to build efficient data processing routines within Data Services reducing batch windows and the overall processing power.
Source Based CDC
With Source based change data capture any record changes are identified at the source and only those records are processed by the Data Services engine. These changes can be pre-identified using techniques such as SAP IDOC’s or Sybase Replication Server (now integrated in the latest version of Data Services) or dynamically identified by using pushdown logic and timestamps etc.
There are various methods available with Data Services, I have used a number of these within many different scenarios. With Data Services there is nearly always two or more ways to get where you need to be to achieve the required result and this comes down to the ETL developer’s level of creativity and caffeine one the day. The following are just a rule of thumb that I use, they don’t always work in all scenarios as there are usually many variables that need to be taken into consideration, but as far as my thought processes go these are the different stages I go through when trying to identify the best methods for change data capture.
So the customer has said that they want to do CDC the first questions I always ask are:
What is the source database/application?
How much data is in the source system tables we are extracting from?
How often do we need to extract, and what is the batch window?
- If the source table has low data volumes and the batch window is large, then usually I will go for the easiest path especially in a proof of concept, which for me will be reading all of the data every time and applying auto correct load in Data Services to carry out updates and inserts.
- If the source data is changing often and is of high volume, but there is a reasonable overnight batch window, I would typically ask if the source table that I am extracting data out of have a valid time stamp. A valid trustworthy timestamp is key, some system don’t always update the timestamp only on insert for example. If this were available then I would consider looking at timestamp pushdown in Data Services. TimeStamp push down requires a number of job steps to be configured:-
- Firstly a global variable to hold “last run date time” would need to be defined for the job.
- A database table would need to be created with a date time field in it and JOBID field.
- The last run date time variable would be populated using a script at the start of the job workflow to get the last run date time value from the lookup table.
- Within the Query where you would set the following (sourcedatetime field > lastrundate variable) .
- Check to see that the where clause is being pushed down by viewing SQL or ABAP.
- The last step is back at the workflow level to then use either a script or Dataflow (I prefer dataflow) to then update the lookup table with a new datetime value.
In the latest version of Data Services (4.2) within the Workbench the above timestamp step example above can be configured as part of the replication wizard. If the source system is SAP I would also look at using the CDC functions available within the content extractors as this is preconfigured functionality and doesn’t require any of the above job configuration steps.
If data needs to be extracted at various points throughout the day then the pushdown method could still be an option however, I am always very cautious about impacting performance on the source systems and if there is a chance that performance degradation is going to affect a business-transacting then I would opt for a different approach where possible.
- If the source system is changing regularly, has high data volumes, the data needs to be transfer intraday and the extract should have little/no impact, I would look at either using IDOC for SAP or using the database native CDC mechanisms supported by Data Services. Configuration of these methods are fully documented within the Data Services manuals but typically they require the customer to have some database functions enabled which is not always possible. Also depending on the database type a slightly different mechanism is used to identify changes. This has in the past limited me to when I have been able to take advantage of this approach. Within the latest version of Data Services 4.2, configuring database CDC is made easier as this can be done through the wizard within the Workbench or configured users can simply define a CDC method based on their data store configuration. If the option is greyed out then this datastore type does not support native application CDC.
- If the source data changes frequently and needs to be processed nearly instantaneously and have little or no impact on the source systems, I would consider using a log interrogation based approach or a message queue which has changes pre-identified within the messages (eg iDoc/ALE). For noninvasive log based CDC , Sybase replication working with Data Services enables data to be identified as a change using the database native logs, flagged with their status (insert/update) and shipped to Data Services for processing. If this real-time non invasive approach to data movement is something that is key to a project then I would recommend complimenting Data Services with Sybase Replication Server.
When working on site with customer data the source systems and infrastructure will nearly always determine what methods of change data capture can be put to best advantage with Data Services. However, given a free reign the best and most efficient approach without doubt is to carry out the data identification process, as close to the source system as possible, however, that isn’t always an available option. In the next blog post I will dig a little deeper into using target based change data capture method.