Hana Smart Data Integration - The magic behind Ada...

werner_daehn · ‎09-30-2015

When writing Hana Adapters supporting Realtime, one of the parameters to be sent along with each row is its Opcode: Insert/Update/Delete/... Hana then does receive such row and does insert/update/delete the record. Not much magic there is, is there?

For normal source databases and simple 1:1 replication of tables, indeed there is not much more needed. Hence these are the two areas requiring more than just that.

Upsert

The RSS Adapter is a first example where more than that is needed. An RSS feed is nothing else than a web page read frequently where the web server lists all recent changes in a standardized (XML) format. Each page has an unique URL - could be seen as primary key - and therefore the adapter knows what the 10(?) most recent added webpages had been. If each of these 10 rows would get an Opcode of Insert, the first refresh would work, the second for sure would get a primary key violation for some rows at least. It would not even help to remember the last call and the fetched pages there as the adapter might have been restarted.

Hence one solution would be for the adapter to compare if such row exists already and if it has the same contents. And when there are transformations applied to the data? Does not make sense.

So there is another Opcode to be used for those sources where the adapter has no knowledge if that changed row was processed already, is brand new or got changed - the Opcode Upsert. For such rows the source table requires a primary key.

Update

Speaking of primary key, an update row is actually two rows produced by the adapter, the row to be updated (before image values) and its new values (after image values). Only with both pieces of information the update does work correctly.

Example: In the source there is a customer record with the customer id = 1234 and name = Jack, this row gets updated to customer id = 9999 and name = Jil. Without the information it is row with customer id = 1234 to be updated, the applier cannot create the correct update statement.

update table set customer_id=9999, name='Jil' where customer_id=1234 and name = 'Jack';

Which again leads to the question, what if the old values are unknown?

Most sources do not provide access to both, the previous values and the new values. As long as the source table has a primary key, no problem to send the Update row only and no Before row.

Delete

The Delete requires the old values as well to delete the old row like in

delete table where customer_id=1234 and name = 'Jack';

If the table has a primary and only the key to be deleted is known, an eXterminate row should be sent instead.

Exterminate

As said, the Exterminate is a delete with the PK columns being set only, all other column values are null. The reason this is a second Opcode will be seen shortly when talking about transformations.

Truncate

Sometimes it is required to delete an entire set of rows with a single command. An obvious example would be a truncate-table command in the source. Then the adapter does not know the individual rows to be deleted but the range, in this example all rows. Hence the adapter would send a row with Opcode Truncate and all columns being null.

Another example would be a source command like alter table drop/truncate partition, e.g. the partition with the column YEAR = 2010. In order to replicate such source change, the adapter has to send a Truncate row with all columns being null except the YEAR column, this has a value of 2010.

Replace

Another delta strategy could be to truncate a range of records and then insert the same rows again. As there is a dependency between the Truncate and the Insert rows, the adapter should create a truncate row to remove the contents and then send Replace rows with the new values.

Such a delta approach is especially useful if the adapter knows something changed but not all, especially not deletes.

Example: The source tells that there was a change in the SalesOrder ORDER=1234. One line item got added ITEM=4, another updated ITEM=3, a third deleted ITEM=1. But this is unknown to the adapter. Hence the adapter would send a Truncate row with ORDER=1234 and then send all currently found rows in the source, so ITEM = 2,3,4.

Opcodes and Transforms

The main issue we faced during development of realtime transform was to process the change rows and their opcodes properly.

Example: Source is a customer master table with the primary key CUSTOMER_ID. This source data goes through a filter transform, taking only the US customers, feed them through an Address Cleanse transform and load a target table that has a surrogate key as primary key.

Inserts are simple, they flow through all transforms, a surrogate key value is assigned from a sequence and inserted into the target.

Updates and Deletes are also flowing through all transforms. The only requirement for them is that the source primary key columns exists in the target as fields as those are used as the where clause of the update/delete.

Exterminate and Truncate rows are a problem as these rows do not have values on all columns, hence the transformations could produce totally different results. The filter transform has a condition REGION='US' but for those rows the REGION is null always,hence the rows do not make it through the transforms and are never applied to the target, which would be wrong. Therefore these rows are processed out-of-band. The first thing the flowgraph does is scanning the input for these rows. All of these rows are then applied in the target first, for example the Exterminate row for CUSTOMER_ID=4567 causes a delete where CSUTOMER_ID=4567 in the target table. For these opcodes the target table has to have the source primary key columns loaded unchanged and all the columns used as truncate scopes.

Replace rows are like Insert rows.

Opcode and TableComparison

If the flowgraph has a TableComparison transform, above out-of-band processing is rather than in the TC transform, not the loader itself. Same with Exterminate rows.

The task of the TC transform is simply to take the input and to create a dataset with Insert/Update/Delete Opcodes only. So even a Truncate+Replace is turned into a dataset with Insert/Update/Delete and some rows being discarded in case they did not change at all.