How to avoid NULL (?) values when using dimension attributes with fact table columns that contain blanks ( ” )
Purpose: Describe a method to address a common DW/BI problem of not having a matching row in a dimension for a given fact where the fact column is blank (” ) whitespace. In general, we want to avoid returning null attribute values for a given entry in a fact. Just as a side note – this issue is not specific to HANA and can (and does) need to be addressed in whatever database your solution may be implemented in. This simply describes an easy way to solve using HANA based components.
Interesting to note, in SAP BW, master data tables/InfoObjects always have an entry with a blank row. If I had to guess, this is to ensure this behavior does not occur!
Real World Problem: User observes that while querying a given model in HANA, there is $300 in total sales by “attribute 1”, with $100 of that falling into a null (“?”) “attribute 1”. When the user now implements a filter on “attribute 1”, the null value is dropped and the $100 “disappears”, which can cause some heartburn for the average user, “where did my sales go?!” The model is built with a fact and dimension, having a left outer join with a 1:n relationship.
Shouldn’t this join type never drop data in the fact? The answer is no, even in a left outer join, when you apply a filter on the right table an inner join is effectively executed. If you have a fact table that has blank ( ” ) values in it, and no matching blank ( ” ) value in the dimension, then the inner join drops the key figure value from the result.
Generally, any ECC tables replicated with SLT will have tables with columns that are non-nullable with default values of 0 (for decimal) or ( ” ) for NVARCHAR, therefore if there is no value in a given column it will always be stored internally as blank ( ” ) or whitespace.
Observing the “problem”
CREATE TABLE "MOLJUS02"."TEST_WHITESPACE_FACT" ("MATNR" NVARCHAR(18), "SALES" DECIMAL(15,2)); CREATE TABLE "MOLJUS02"."TEST_WHITESPACE_DIM" ("MATNR" NVARCHAR(18), "ATTR1" NVARCHAR(18)); INSERT INTO "MOLJUS02"."TEST_WHITESPACE_FACT" VALUES ('1234567', 100); INSERT INTO "MOLJUS02"."TEST_WHITESPACE_FACT" VALUES ('1238890', 100); INSERT INTO "MOLJUS02"."TEST_WHITESPACE_FACT" VALUES ('', 100); INSERT INTO "MOLJUS02"."TEST_WHITESPACE_DIM" VALUES ('1234567', 'BLUE'); INSERT INTO "MOLJUS02"."TEST_WHITESPACE_DIM" VALUES ('1238890', 'RED'); --Simulate a dim/fact query SELECT B."ATTR1", SUM(A."SALES") FROM "MOLJUS02"."TEST_WHITESPACE_FACT" A LEFT OUTER JOIN "MOLJUS02"."TEST_WHITESPACE_DIM" B ON (A."MATNR" = B."MATNR") GROUP BY B."ATTR1" --Simulate a dim/fact query SELECT B."ATTR1", SUM(A."SALES") FROM "MOLJUS02"."TEST_WHITESPACE_FACT" A LEFT OUTER JOIN "MOLJUS02"."TEST_WHITESPACE_DIM" B ON (A."MATNR" = B."MATNR") WHERE B."ATTR1" <> 'BLUE' GROUP BY B."ATTR1"
Result from first query
Result from second query – Where did my $100 go?! If I excluded Blue materials, I should still have $200 remaining!
Real World Solution: It’s actually quite simple – you always want to have a matching record in your dimension for any possibilities that may exist in the fact. In some applications including SAP ERP, you can have transactions that have no value for a given column. In my example above, the real scenario occurred with CO-PA when there were transactions with Sales values that had no MATNR (Material) assigned, only a ( ” ), so it is certainly possible that this can happen in a production environment.
To solve the above, we simply need a record in the dimension with a ( ” ) value in the key(s). This would avoid any chance of null values occurring when using an attribute.
INSERT INTO "MOLJUS02"."TEST_WHITESPACE_DIM" VALUES ('', '');
Now, lets run the same queries above again and observe the difference.
( ” ) instead of null is shown
No longer dropping any sales since we now have a matching dimension, my data is back!
Now, the above is all good and fine – the solution is straightforward you are saying. I can just insert a record into the dimension tables required in each environment required, right? My answer would be, that’s no fun – why would you manually do this when you can have HANA perform this itself? Let’s create a stored procedure that does this work for us! 🙂
We need to create a stored procedure that inserts a whitespace row into every table in a target schema. In my case, this was an SLT managed schema, so some elevated rights are needed. To insert into an SLT managed schema, you either need to havethe role <SCHEMA_NAME>_POWER_USER (which is created when an SLT connection is made) and create the procedure as “Run With” Invokers Rights OR assign the same role to user SYS_REPO and choose “Definer’s Rights” for the procedure.
– Procedure has input parameter to change which target schema to insert whitespace records into
– Read all tables of target schema that are not SLT or DD* tables, and also have non-nullable columns
– Omit any tables that have at least one non-nullable column, or if it has a nullable column and no default value also omit it.
– Only look at tables where the column in position 1 is a primary key
– Build a worklist of table/key to loop at
– Build a dynamic SQL statement to insert a whitespace record into each table in the schema.
Below is the source from SYS_BIC representation to show the parameter.
create procedure "_SYS_BIC"."sandbox.justin.insert_whitespace/INSERT_WHITESPACE_2" ( in process_schema NVARCHAR(12) ) language SQLSCRIPT sql security invoker default schema "DE2_DH2" as /********* Begin Procedure Script ************/ -- scalar variables i INTEGER; row_count INTEGER; statement NVARCHAR(256); BEGIN --Work list table_list = --List of tables in the target schema that match criteria --perform minus to remove any tables that have at least one --nullable field or a nullable field plus no default value SELECT DISTINCT "TABLE_NAME" FROM "SYS"."TABLE_COLUMNS" WHERE "SCHEMA_NAME" = :process_schema AND "IS_NULLABLE" <> 'TRUE' AND "DEFAULT_VALUE" is not null AND "TABLE_NAME" NOT LIKE 'RS%' AND "TABLE_NAME" NOT LIKE 'DD%' MINUS --List of tables that have at least one nullable column, or one --non nullable column that has no default value SELECT DISTINCT "TABLE_NAME" FROM "SYS"."TABLE_COLUMNS" WHERE "SCHEMA_NAME" = :process_schema AND ("IS_NULLABLE" = 'TRUE' OR ("IS_NULLABLE" = 'FALSE' AND "DEFAULT_VALUE" is null)) AND "TABLE_NAME" NOT LIKE 'RS%' AND "TABLE_NAME" NOT LIKE 'DD%'; --Use previous table list and find the column that is a primary key --and is in position 1 for each table, in order to build insert statement table_columns = SELECT A."TABLE_NAME", B."COLUMN_NAME" FROM :table_list A INNER JOIN "SYS"."INDEX_COLUMNS" B ON (A."TABLE_NAME" = B."TABLE_NAME") WHERE B."POSITION" = '1' AND B."CONSTRAINT" = 'PRIMARY KEY'; -- store dataset size in row_count variable SELECT COUNT(*) INTO row_count FROM :table_columns; --Loop over result set, building and executing an INSERT statement for the required tables FOR i IN 0 ..:row_count-1 DO SELECT 'INSERT INTO' || ' "' || :process_schema || '"."' || "TABLE_NAME" || '" ' || '("' || "COLUMN_NAME" || '") VALUES ('''')' INTO statement FROM :table_columns LIMIT 1 OFFSET :i; EXEC(:statement); END FOR; END; /********* End Procedure Script ************/
Nice exercise in dynamic data manipulation in 'foreign' schemas and a good write up, too.
Thanks for that!
Just this comment:
Although it's tempting to solve the missing data issue en block, typically it's not advisable to do that.
If there are NULL values in ECC tables, then this is a problem, since ABAP does not support NULL values.
So, taking a step back here to find the cause of the NULL values (and eventually replace them with proper DEFAULT values) would be advised.
This can also happen during SLT replication.
Once the data is free of NULLs, you could actually go a step further and replace the outer join with an inner join (speeding up the join processing).
Hmm... here we go again - fighting the bad NULLs again 🙂
Thanks again for this piece!
Hey Lars, thanks for the feedback. Maybe it was not so clear what I was trying to solve here.
The issue was not that NULL values were being delivered from ECC, in fact in my real examples I was getting proper default ( '' ) values in the fact tables per their definition. So the source data was coming correctly in all regards, I am not simply trying to fix bad data in the DW (we all know it should always be fixed in the source 🙂 ).
The problem comes when I don't have a matching blank ( '' ) value in the dimension to which the fact is joined, when the fact has a blank ( '' ) in the joined column. In business terms, it is quite possible to have a CO-PA document that does not have a material on it, so when it arrives in HANA the MATNR column on the fact will default to blank ( '' ). This is all fine and good if I am just using an attribute as *display* in my result (see first SQL example), but as soon as I start applying filters to that dimension, that's when the results get confusing (see second SQL example - specifically it would only be an issue with a WHERE clause containing a NE <> operator).
Take another look at the SQL example in the setup above before the fix was applied, and I think it will become more clear.
On another note - you mention inner joins speeding up processing vs. a left outer join. I actually found the opposite to be true when only using the joined key column per. However, I don't have any real evidence to support using any of the non-joined attribute columns. Do you have anything to share in this regard? Always looking for an edge in processing!
Ah, so this was actually all about what we call the INITIAL record in SAP BW? 🙂
You're right, that's what we always generate into the dimension tables to avoid the NULLs that are generated by non-matching records during the outer join.
Concerning the possible optimization for the outer join you mentioned in your other blog post: correct, as long as the join cardinalities in the analytic view are setup correctly (n:1 that is) the OLAP engine can figure out that the result set won't be changed by performing the join and then skip it, when you don't actually want to see data from the joined table.
That's pretty much the same as what the referential join does for an inner join operation.
So, the outer join can be quicker than an inner join when it's not actually being executed.
Set the cardinality wrong or force the join otherwise to be executed and you should see a different picture.
Well, I wouldn't say it was "all about" BW per se, but it was an obvious correlation after thinking about it for some time, its the same solution to the same problem.
Not to beat the join topic to death, but the analysis in that link shows that in the scenario that only the joined column is requested from the right table, the LOJ does not perform a join, but the Referential does. So in that regard, it would mean to me that the LOJ is cheaper since it can be smarter about when the join is really required.
In Referential, if any column from the right table is selected, it will join. In the LOJ, if any column other than the joined column is selected, it will join.
Oh, and I should point out too, that in BW master data there is always a row with a blank ( '' ) value present to avoid such situations. Update the original post with that info too.
I came up with the exact same set of questions recently, and you covered that very well above!
One of your points confused me quite a bit at first, I believe it should read:
"- Omit any tables that have at least one nullable column, or if it has any non-nullable column without a default value also omit it."
As you are not supplying any MANDT, your inserts will all end up being MANDT='000'.
Am I correct that while that is fine for cross-client reporting, if one would want to report by session client the insert needs to specify the wanted client id? (For tables containing a MANDT column that is.)
For certain tables, your script does not take into account the logic of multi-column PKs.
For example for Regions in T005U, one would need to add a record with a blank BLAND for every existing Country LAND1 (and what if an additional country later gets entered into T005U?)
Similar for Storage Locations (per Plant), Profit Centers (per Controlling Area), etc.
Spot-checking a few usual suspects in the ECC system I'm interested in, I see MARA gets sorted out by your script because a nullable column VARID exists. I assume those would need to be taken care of manually?
Now the 1 million dollar question:
Would you / have you run your script against a Production HANA ECC database?
In other words are no negative side effects to be observed, caused by those directly inserted blank records such as a blank TVKO (Sales Organization)?
Hi Michael, thanks for the feedback, it's been quite some time since I have thought about this.
The client did not in fact have a need for cross client reporting, in general I don't see that happening.
You have a very valid point on multi column PK, you could easily adjust the logic to accommodate that. It would be required if you wanted to let this loose on an entire schema.
Now as to you last question - the customer did use this in a production sidecar instance with slightly modified logic, ie some specific table lists. I wouldn't suggest doing this in a SoH instance nor would I know if there would be adverse side effects, so take that with a grain of salt. This was just an exercise that I thought would be interesting to share.
This blog helped me. Thank you.
I wanna know why null values are getting filtered when we are filtering for BLUE.
Any related knowledge on NULL values would be helpful.
Happy HANA :-p