HANA Arrays and Dynamic Filters in a DataQuality Framework
We recently concluded an engagement wherein we used HANA to join over 80 source tables and provide a 70+ column DATA FEED to an upstream system for further processing. For reasons of client confidentiality I am not permitted to go into the functional aspects of the project.
Our methodology was to use a series of Calculation views over source system tables and arrive at the final format.
The upstream process was sensitive to Data Quality issues and records with errors need to be filtered out prior. Hence there was a need to run a Data Quality Check on the feed just prior to sending it to the upstream system with the following requirements
The client had specific rules around formats/values which are valid for each of the 70+ fields. Data Quality process should check for these rules for each and every field and
a. Mark the record as ‘erroneous’ and not send it to the upstream system in case any of the fields failed the Data Quality Check.
b. Move the record identifier to a separate table along with information on which Data Quality rule was violated.
c. Mark subsequent related records as ‘dependent erroneous’ if they were related to the records found in step a. There was a particular field – say ‘Customer_Id’ which linked multiple records. So for Customer_id X43Y if the record coming in on 01/01/2017 had an error than all subsequent records coming in on say 01/02/2017 for Customer_id X43Y should be held back with a ‘dependant DQ error’ even if these records had no quality error in them.
d. Be reportable by creating Analytics with SAP BI.
We achieved this through by designing a Data Quality framework and making use of Dynamic SQL and HANA arrays. Our solution comprised of
1. A Data Quality Rules table
2. A Status Codes table which had error status codes
6 – Data Quality Error
7 – Record hold due to prior Data Quality Error
3. An error output table for which held the record identifier, the error field details and linkage to the Data Quality Rules table
4. A procedure using Dynamic SQL to run the rules and facilitate output.
Table Structures:
1. Data Quality Rules Table
CREATE COLUMN TABLE "SCHEMA_Q"."DATA_QUALITY_CHECK_RULES" ("DQERR_ID" INTEGER CS_INT,
"TABLE_NAME" VARCHAR(100),
"COLUMN_NAME" VARCHAR(100),
"VALIDATION_RULE" NVARCHAR(1000),
"VALIDATION_RULE_DESC" VARCHAR(100),
"ERROR_CODE" VARCHAR(500),
"ACTIVE_KEY" VARCHAR(1)) UNLOAD PRIORITY 5 AUTO MERGE
Data Quality Rules for each of 70+ were inserted into the Rules table. Some of the checks can be implemented using Database Constraints. But this would cause our end-to-end process to fail. We wanted to avoid that.
Insert Sample:
insert into "SCHEMA_Q"."DATA_QUALITY_CHECK_RULES"
values(1,
'"SCHEMA_Q"."TABLE1"',
'T1_F7',
'T1_F7 IS NULL OR LENGTH(TRIM(TI_F7)) < 3',
'T1F7_ID SHOULD NOT BE NULL AND THE MINIMUM LENGTH SHOULD BE 3','T1_F7 INVALID',
1);
2. Status Codes Table
Status Code, Description, Active_Flag
6, Data Quality Error, 1
7, Dependent DQ Error, 1
3. TABLE1_DQERROR
DQERR_ID,<T1F1_KEY>,<T2F2_KEY>,<T3F3_KEY>,<T4F4_KEY>,ERROR_FIELD
(Record identifier of the table on which you run the DQ check)
4. Procedure Code: is provide only for reference purposes; has not been production tested and should be used at your own risk.
It is important to understand that below disadvantages of HANA Dynamic SQL and pass it through the organization security lenses.
Dynamic SQL allows you to construct an SQL statement during the execution time of a procedure. While dynamic provides more flexibility in creating SQL statements, it does have the disadvantage of an additional cost at runtime:
- Opportunities for optimizations are limited.
- The statement is potentially recompiled every time the statement is executed.
- You cannot use SQLScript variables in the SQL statement.
- You cannot bind the result of a dynamic SQL statement to a SQLScript variable.
- You must be very careful to avoid SQL injection bugs that might harm the integrity or security of the database
ARRAY_AGG : converts the column of a table into an array
APPLY_FILTER: The APPLY_FILTER function applies a dynamic filter on a table or table variable. Logically it can be considered a partial dynamic sql statement. The advantage of the function is that you can assign it to a table variable
PROCEDURE "SCHEMA_Q"."QPACKAGE.PROCEDURES::DQ_CHECKS" (IN PROC_DATE DATE )
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER
DEFAULT SCHEMA "SCHEMA_Q"
--READS SQL DATA
AS
BEGIN
/**************************************************************************************
# Procedure Name: DQ_CHECKS
# Created By: ME
# Program
# Date: 28/Jul/2016
# -------------------------------------------------------------------------------
# Parameters
# Input : Date
# Output - #NULL
# --------------------------------------------------------------------------------
This procedure does DQ checks
1- manages the status
2 - captures the field in error
3 - holds all subsequent related input records from going upstream
If your SQLScript procedure needs execution of dynamic SQL statements where the parts of it are derived from untrusted input (e.g. user interface), there is a danger of an SQL injection attack. In the below case there is no user input - values are taken from a table.The data parameter procedure will be passed using Data Services and manual run access is provided only to trusted personnel
The code assumes that we are dealing with 1 output tables - TABLE1 It is not made generic for number of tables.
It is generic for the number of conditions for the table in consideration. FOR more tables the same code can be duplicated in the procedurefor each table.
--UNABLE TO USE DYNAMIC SQL HERE -- TODAYDAT = SELECT * FROM "SCHEMA_Q"."TABLE1" WHERE TO_DATE(ENT_TIME) = :PROC_DATE
You cannot bind the result of a dynamic SQL statement to a SQLScript variable
TO DO: iNCLUDE REPROCESSING LOGIC
-------------------------------------------------------------------
# Version Notes
#
# Date Version - Comments - Modified By
# -------------------------------------------------------------------
# 12/09/2016.
*************************************************************************/
DECLARE IM_FILTER_STRING NVARCHAR(200);
DECLARE VALRULEARR NVARCHAR(1000) ARRAY;
DECLARE ERRORCODEARR INTEGER ARRAY;
DECLARE TABNAMEARR VARCHAR(100) ARRAY;
DECLARE COLNAMEARR VARCHAR(50) ARRAY;
DECLARE TABLECNT INT;
DECLARE ARRIND INT;
DECLARE I INT;
DECLARE MAXIND_T2 INT;
DECLARE MAXIND_T1 INT;
--- Processing for TABLE1 Data Quality Error Begins
TODAYDAT_T1 = SELECT * FROM "SCHEMA_Q"."TABLE1" WHERE TO_DATE(ENT_TIME) = :PROC_DATE AND ERR_STATUS = 0;
VALRULE_T1 = SELECT DQERR_ID,COLUMN_NAME,VALIDATION_RULE FROM "SCHEMA_Q"."DATA_QUALITY_CHECK_RULES" WHERE ACTIVE_KEY =1 AND TABLE_NAME ='"SCHEMA_Q"."TABLE1"' ORDER BY DQERR_ID;
VALRULEARR = ARRAY_AGG(:VALRULE_T1.VALIDATION_RULE);
ERRORCODEARR = ARRAY_AGG(:VALRULE_T1.DQERR_ID);
COLNAMEARR = ARRAY_AGG(:VALRULE_T1.COLUMN_NAME);
SELECT COUNT(DQERR_ID) INTO MAXIND_T1 FROM "SCHEMA_Q"."DATA_QUALITY_CHECK_RULES" WHERE ACTIVE_KEY = 1 AND TABLE_NAME = '"SCHEMA_Q"."TABLE1"';
FOR ARRIND IN 1 .. :MAXIND_T1 DO
OUTPUT_TABLE_T1 = APPLY_FILTER(:TODAYDAT_T1,:VALRULEARR[:ARRIND]);
INSERT INTO "SCHEMA_Q"."TABLE1_DQERROR"
SELECT DISTINCT
999,"T1F1_KEY" AS CUSTOMER_ID,
"T1F2_KEY","T1F3_KEY",
"T1F4_KEY",:COLNAMEARR[:ARRIND]
FROM :OUTPUT_TABLE_T1;
UPDATE "SCHEMA_Q"."TABLE1_DQERROR" SET DQERR_ID =:ERRORCODEARR[:ARRIND] WHERE DQERR_ID = 999;
commit;
END FOR;
UPDATE
"SCHEMA_Q"."TABLE1" C
SET
ERR_STATUS = 6 -- dqerror
FROM
"SCHEMA_Q"."TABLE1" C
INNER JOIN
"SCHEMA_Q"."TABLE1_DQERROR" B
ON
B.T1F1_KEY = C.T1F1_KEY
AND TO_DATE(C.ENT_TIME) = :PROC_DATE
AND C.ERR_STATUS = 0;
--- Processing for TABLE 1 Data Quality Error Ends
---FINAL SYNC FOR CURRENT DATA BASED ON PAST DATA
ALL_DQ = SELECT DISTINCT CUSTOMER_ID FROM "SCHEMA_Q"."TABLE1" WHERE ERR_STATUS IN (6,7);
---code for updating dependent hold
END;
Execution:
Execute the procedure with a normal call
CALL “SCHEMA_Q”.”QPACKAGE.PROCEDURES::DQ_CHECKS”(‘20161214’)
The procedure will loop through the rules table and execute all the Active Rules.
Performance Considerations and Cons:
1. The Rules have to be manually maintained unless the solution is extended with a GUI.
2. Generally Dynamic SQL has a performance disadvantage. Performance needs should be weighed with an implementation of Dynamic SQL
In our case for processing around 40,000 records and running a check on each of the 70+ fields – the procedure took on average just under a minute (50-70 secs) to complete under regular load on a single node system assuming an error in each record.
Processing 3M+ rows with multiple errors per row took the processing to 5+ mins.