|based on SAP HANA rev. 81
Just shortly before my vacation starts I thought I leave you with another pearl of knowledge… *cough*cough*
Don’t expect suspense or a proper three act structure – it’s just one of those techie blogs that you might put on your “read later” list and then forget about it…
Anyway, that’s how I tell this story:
Last week a colleague reached out to me and presented the following case:
“We have a data warehouse system with fact tables and master data tables.
Between fact and master data tables, foreign key constraints have been set up to ensure data consistency.
Now, whenever we load master data, the transaction (sic!) tables grow in size, while the number of records stays the same.“
What could be going on here?
Quantum entanglement effects on SAP HANA column store tables?
(and this really is all I come up with to rectify the super-cheesy title… 🙂 )
When I read this, I first thought this likely was a misobservation.
But, alas, sometimes you just have to try things out.
And so I did this:
1. Setup the tables
CREATE COLUMN TABLE masterdata (id INTEGER PRIMARY KEY, md_data NVARCHAR(20));
CREATE COLUMN TABLE transactions (id INTEGER PRIMARY KEY, data NVARCHAR(20)
, md_id INTEGER
, FOREIGN KEY (md_id) REFERENCES masterdata ON UPDATE CASCADE);
2. Load some dummy data
— load some masterdata
INSERT INTO masterdata VALUES (1, ‘MD1’);
INSERT INTO masterdata VALUES (2, ‘MD2’);
INSERT INTO masterdata VALUES (3, ‘MD3’);
— load some transactions
insert into transactions values (1, ‘TX1’, 1);
insert into transactions values (2, ‘TX2’, 2);
insert into transactions values (3, ‘TX3’, 3);
insert into transactions values (4, ‘TX4’, 1);
— do some storage cleanup
UPDATE masterdata WITH PARAMETERS (‘OPTIMIZE_COMPRESSION’ = ‘FORCE’);
UPDATE transactions WITH PARAMETERS (‘OPTIMIZE_COMPRESSION’ = ‘FORCE’);
MERGE DELTA OF masterdata WITH PARAMETERS (‘FORCED_MERGE’=’ON’);
MERGE DELTA OF transactions WITH PARAMETERS (‘FORCED_MERGE’=’ON’);
3. Check the table storage
SELECT table_name, memory_size_in_total,
table_name IN (‘MASTERDATA’, ‘TRANSACTIONS’)
ORDER BY table_name;
4. Check the column storage for TRANSACTIONS table
SELECT column_name, count, distinct_count
ORDER BY column_name;
So, up to here everything is normal and as expected.
Now, we want to load some new master data.
A common approach is to run a full update and that’s what I will do here as well.
To make things a little more handy, I set up a second table with our new master data, called MD_STAGING.
It contains the same records that are already present in table MASTERDATA, except for one updated record, plus two “new” records.
CREATE COLUMN TABLE md_staging (id INTEGER PRIMARY KEY, md_data NVARCHAR(20));
INSERT INTO md_staging VALUES (1, ‘MD1’);
INSERT INTO md_staging VALUES (2, ‘MD2’);
INSERT INTO md_staging VALUES (3, ‘MD3_NEW’);
— the “new” data
INSERT INTO md_staging VALUES (4, ‘MD4’);
INSERT INTO md_staging VALUES (5, ‘MD5’);
5. Now let’s “load” the new data
Loading the new master data basically consists of two steps:
- INSERT any actually new records and
- UPDATE the ones that we already have with the current data.
A well known ETL software (Data Services and Data Quality) would probably do something similar to this:
UPDATE masterdata SET id = new.id,
md_data = new.md_data
WHERE masterdata.id = new.id;
INSERT INTO masterdata
(SELECT is, md_data FROM md_staging
WHERE id NOT IN (SELECT id FROM MASTERDATA));
So, let’s do this…
Statement ‘UPDATE masterdata SET id = new.id, md_data = new.md_data FROM md_staging new WHERE masterdata.id = …’
successfully executed in 134 ms 456 µs (server processing time: 97 ms 711 µs) – Rows Affected: 3
Statement ‘INSERT INTO masterdata (SELECT id, md_data FROM md_staging WHERE id NOT IN (SELECT id FROM …’
successfully executed in 97 ms 91 µs (server processing time: 58 ms 243 µs) – Rows Affected: 2
Checking the numbers for the affected rows we see that 3 existing records have been UPDATED, although only one of them had actually been changed and 2 records have been INSERTed.
Looks OK to me, I’d say (for now)…
Next, let’s check the table storage again:
Compare that with what we had before:
No surprise for table MASTERDATA, but look what happened on the TRANSACTIONS table!
SHOCK, AWE and WONDER!
There are four records in the delta store now, although we didn’t actually changed any referenced data.
Checking on the column statistics for table TRANSACTIONS we find this:
Now there are 8 ❗ entries for every column, although we only have 4 distinct ID values and, as we know, only 4 records in total.
What is going on here?
This actually is the combined effect of two features in SAP HANA.
- UPDATEs are stored row-wide in the delta store and performed regardless if any data was actually changed.
Whenever we issue an UPDATE command, SAP HANA has to identify/find the record(s) to be updated first.
Once this is done, the whole record is copied, all SET-parts of the UPDATE command are applied to the copied record and the record is stored in the delta store. Finally the old record gets marked as invalid and the new record becomes the new valid record.
This is commonly called insert only database storage.
For our blog what’s interesting is that SAP HANA does not check whether anything actually changes.
Even if the SET-part of the UPDATE command sets the exact same values this change gets executed and stored in the delta store (and of course also in the redo log).
- The UPDATE action for the referential constraint is set to CASCADE.
So every update on a referenced column ❗ will lead to an update on the referencing table as well.
So far we’ve learned that performing a full update on the MASTERDATA table could lead to a lot more records to be touched then what we would intuitively think.
Now you should be asking: “What could be done to prevent this”?
There’s a couple of options:
a) Go without foreign key constraints for your data warehouse.
That’s what most DW vendors do, since FKs really tend to complicate things with data loading once more than a few tables use the same master data.
E.g. SAP BW does it that way.
b) Drop and recreate the foreign key constraints before/after data loading.
SAP HANA does currently not allow to disable FK constraints or to re-validate them.
This however is a nonsense option as exactly during the time of data modification – the time when you want the constraint to be active – it would just not be there.
c) You ensure that the referring column(s) – ID in our example – does not get updated.
This is actually not too difficult to achieve.
A small change in the UPDATE command we used above already would do the trick:
UPDATE masterdata SET md_data = new.md_data
where masterdata.id = new.id;
The downside is that the otherwise practical UPSERT statement won’t work here since it needs to have the values for ALL columns in any case.
That’s it again.
I bet you didn’t expect this or did you?