Used resources:
Required system components and tools:
Required system privileges:
Contents:
(1) Purpose/objective
The data mask node is a purposeful function of SDI/SDQ to anonymize sensitive data as e.g. personal user data, credit card or payment data in a facilitated and fast way. It comes with built in algorithms that let you specify the level of anonymity and randomization of masking values. It is easy to understand and apply, however it can be helpful to refer what somebody has tested already.
After reading this blog you should be able to deal with the basics of the data mask node of HANA SDI/SDQ. It can be used as a reference how certain rule types of the data mask node behave and how they are applied. This blog picks up explanations of the HANA EIM SPS12 Configuration guide and illustrates them in a more comprehensive and graphical way.
I also created this post as there was no specific HANA Academy Video dealing with the data mask node.
(2) Data mask node & introducing steps
Preliminary steps:
Create table statement & insert statements for some sample data:
CREATE COLUMN TABLE "<SCHEMA>"."SampleCustomers" ("ID" TINYINT CS_INT, "NAME1" VARCHAR(2048), "STREET1" VARCHAR(2048), "HOUSNO" TINYINT CS_INT, "POSTC" SMALLINT CS_INT, "CNTRY" VARCHAR(2048), "LATID" VARCHAR(2048), "LONGI" VARCHAR(2048), "SURR_ID" NVARCHAR(1) NOT NULL , "EMAIL" NVARCHAR(200), "DATE" DATE CS_DAYDATE, PRIMARY KEY ("SURR_ID"))
insert into "<SCHEMA>"."SampleCustomers" values(1,'Jack','Rue du 31 Décembre',68,'1207','CH','?','?','1','sample@sampleweb.de','14.04.2014');
insert into "<SCHEMA>"."SampleCustomers" values(2,'Pete','Chemin du Château-Sec',2,'1009','CH','?','?','2','testsdi@web.de','14.02.2013');
insert into "<SCHEMA>"."SampleCustomers" values(3,'Norman','Mythenquai',5,'8002','CH','?','?','3','hottopics@newsflash.de','14.01.2012');
insert into "<SCHEMA>"."SampleCustomers" values(4,'Steve','Tösstalstrasse',76,'8400','CH','?','?','4','mcfly@combo.box','14.07.2016');
insert into "<SCHEMA>"."SampleCustomers" values(5,'Josef','Erlenstrasse',1,'8245','CH','?','?','5','see_you@star.box','14.01.2011');
After creating a new flowgraph, go to your canvas and begin...
What I did first is add a filter node onto my canvas. This filter node serves me to add additional columns:
These columns are mapped to the corresponding source columns 1:1 --> NAME1, STREET1, EMAIL, HOUSENO. These columns are later on used to apply the masking/variance rules. The purpose of this is to be able to compare masking/variance results and original values from our source table.
In the filter node (1) add new columns to output and (2) map them by drag&drop operation to their corresponding source columns.
Drag&drop a Data Mask node onto the canvas and connect the filter node:
Double click on the Data Mask node to specify your masking settings. Before saving you should make sure that you defined an output node, e.g. a template table. In this example we won’t always load data into the target but rather use the JIT preview option that can be set on from within the Data Mask node.
In general, consider the following data types that are supported from within the Data Mask node (a column of type CHAR won't work!):
For each column in the Data Mask node, there are three different options (you find this in the node details of you data mask node at the bottom of the page):
(3) Pattern variance and mask rule
In a first step, what we are going to do is mask everything between character 7 & 88 with a “$” sign (for NAME1_MASK column). Everything before character 7 and 88 will be juggled.
Select the mask rule for column NAME1 or any given character string based field that matches one of the data types shown above:
Having applied the mask rule for field NAME1_MASK we move ahead and apply a pattern variance rule for column STREET1_MASK.
The pattern variance option is available for the following data types:
With the pattern variance option you can specify your masking settings for each defined set of characters (by specifying its range with the slider object). You can choose between:
No masking, preserves source characters
Mask source character with defined mask character
Mask source character with defined mask string
Mask source character by random alphanum/numeric value
You can use as many pattern variance types shown above as you want. Just click the "+" sign that you see in the image below and specify the settings, masking values and thresholds.
Applying the pattern variance option in our test scenario:
As I enabled „JIT-Preview“ in my Data Mask node, I am now able to preview how the output would look like:
Column “NAME1”/”STREET1” – original values, column “NAME1_MASK”/“STREET1_MASK” – masked values:
We can now play around with the masking options and see how the output looks like:
(a) Change from masking first 3 characters to last 3 characters for NAME1_MASK field and using the String variance type for STREET1_MASK field.
(b) Using the Preserve and Character variance type:
Column “NAME1”/”STREET1” – original values, column “NAME1_MASK”/“STREET1_MASK” – masked values:
(c) Using another String mask for field STREET1_MASK:
Column “NAME1”/”STREET1” – original values, column “NAME1_MASK”/“STREET1_MASK” – masked values:
(4) Numeric Variance rule
The numeric variance rule is applicable for the following data types:
The following numeric variance types can be applied for numeric data types:
The variance is not an absolute number (cf. “Fixed Number”) but calculated as a percentage of the numeric value that is passed. So for each different database records different min and max values will occur.
The fixed number numeric variance type will result in a value that is within the defined thresholds of min & max and corresponds to the defined variance.
A range will result in randomized values that are within the range, not being bound to any variance values.
In the following we will apply the "Fixed Number" as well as the "Range" numeric variance type.
(a) Numeric variance: „Range“
Column “ID” – original values, column “ID_MASK” – altered values:
(b) Numeric variance: „Fixed Number“
Column “ID” – original values, column “ID_MASK” – altered values:
(5) Email Masking (with Mask rule)
We will mask all characters of email field while maintaining the email format:
Column “EMAIL” – original values, column “EMAIL_MASK” – masked values:
Another option that you can make use of is either keep last or first x characters of your email field, email format maintained:
(6) Date Variance rule
The data variance rule allows for varying dates by specifying minimum/maximum thresholds and values how e.g. days/months or years are supposed to vary from the original date.
This is where you select the date variance rule for a source field:
The date variance rule is applicable for the following data types only:
The variance type for our date field will be “Fixed months” type. There are also some alternative date variance types to choose from:
Applying the date variance type “Fixed Months” in our example: All randomized dates will be within the threshold values of “Jan 1, 2010” and “Jan 1, 2020”. The variance parameter determines the maximum number of months (or days/years) that are subtracted or added on top of the source value to be varied.
(7) Data source - Reconciliation option (since SDI/SDQ SPS12)
While modeling the flowgraph I also used the new reconciliation option that came with HANA SPS12.
This option enables you to reconcile your data sources used in any flowgraph to keep them up to date if some alterations happened (new column, removed column, changed field length…).
I hope this blog entry could help you deal with the data mask node.
Let me know if something is misleading or if you don’t understand any part. I’m open to constructive criticism or fruitful discussions 😉
Best regards
Stefan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
13 | |
10 | |
10 | |
7 | |
6 | |
5 | |
5 | |
5 | |
4 | |
4 |