When loading data from Kafka into Hana Cloud, there are two fundamental questions to answer:
- What to do with nested data
- What to do when the structure changes
The SAP Kafka Connect is built for the one extreme, where Kafka contains changes on table level only – no nested data – and there are no structure changes. What was missing is the other extreme.
A Kafka Sink which can load any Kafka message and stores this nested data in a relational model – the Hana Cloud Loader (this github page contains all about installation, docker container location, config, etc).
Nested Model to Relational Model
If the message is just relational, a single table is enough. For nested objects, the connector creates a separate table for every array field in the Avro schema and the sub-table contains in addition the master table’s primary keys for reference.
In this example the Customer schema has a sub schema containing multiple CompanyAddress lines, thus the tables Customer and Customer_CompanyAddress are created.
The Customer schema has also a field “_audit” but that is just a sub record. Thus the Customer table contains the data of this. But this sub-schema’s “details” fields is an array. Its data is stored in the Hana table “Customer__audit_details”.
The array field “_extension” exists in the Customer and CustomerAddress level. One table Customer__extension contains all data. The columns _RECORD_PATH, _PARENT_RECORD_PATH and FIELD contain the information where in the schema this record was stored.
If new fields are added to the schema, the target tables get extended automatically.
Hana Cloud Data Lake
This approach is exactly what is required if the Hana Cloud is used like a Data Lake, where all data should be stored, no source information should ever be lost.
Further more, each table can be partitioned and the partitions assigned to a data temperature.
In above example, the millions of Customer and Address records are used constantly, the information about the business rules and transformations each record did undergo is 30 times more and infrequently used. These tables are assigned to a warm storage – the Data Lake storage level of Hana Cloud.
One question left are the data types being used when the tables are created. In Avro there are just seven base data types, not even a decimal exists as primitive. Avro supports Logical Data Types to add some more like the decimal. But that is far from what Hana supports.
Of course every Avro data type can be mapped to a default Hana datatype, e.g. an unbounded String gets a NVARCHAR(5000) column (because NCLOBs are slow). It would be better to use a proper data type like NVARCHAR(10), VARCHAR(10), VARBINARY(10) or NCLOB. Thus the Hana Loader needs more information from the schema.
By annotating the Kafka Schema with the exact type, this can be accomplished easily.
In order to keep up with Kafka, the Hana Loader must be implemented with all performance tricks the database interface allows for. The data is loaded via prepared statements and with batching multiple changes on the same table into a single request. This thus reduce the network round trips significantly, a core requirement when interacting with a cloud system.
This post is part of a series and its full power is unlocked when combining it with one of the other components.