Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 

Used resources:


Required system components and tools:

  • HANA Platform SPS12
  • DU IM_DP (corresponding SPS12, see PAM)
  • Web-based development workbench in Chrome
  • Data source with demo data

Required system privileges:

  • GRANT SELECT, UPDATE, INSERT, DELETE, EXECUTE ON SCHEMA “<YOURSCHEMA>” TO _SYS_REPO WITH GRANT OPTION;
  • Object privileges (EXECUTE):
    • MULTI_TEXT_ACCESSOR
    • TEXT_ACCESSOR
  • SELECT on input/output SCHEMA of your flowgraph
  • For JIT-Preview on flowgraph nodes:
    • SELECT and EXECUTE with GRANT OPTION (granted to _SYS_REPO)

Contents:

  • (1)    Purpose
  • (2)    Data mask node & introducing steps
  • (3)    Pattern variance and mask rule
  • (4)    Numeric variance rule
  • (5)    Mask rule – Email field
  • (6)    Date variance rule
  • (7)    Data source: reconciliation option in SPS12

(1) Purpose/objective

The data mask node is a purposeful function of SDI/SDQ to anonymize sensitive data as e.g. personal user data, credit card or payment data in a facilitated and fast way. It comes with built in algorithms that let you specify the level of anonymity and randomization of masking values. It is easy to understand and apply, however it can be helpful to refer what somebody has tested already.

After reading this blog you should be able to deal with the basics of the data mask node of HANA SDI/SDQ. It can be used as a reference how certain rule types of the data mask node behave and how they are applied. This blog picks up explanations of the HANA EIM SPS12 Configuration guide and illustrates them in a more comprehensive and graphical way.

I also created this post as there was no specific HANA Academy Video dealing with the data mask node.

(2) Data mask node & introducing steps

Preliminary steps:

  • Make sure you have a data source where you can easily apply masking operations on specific columns. If you don’t have a source table, you can use the following mock data (you don't necessarily need all columns):

    

        Create table statement & insert statements for some sample data:

      


CREATE COLUMN TABLE "<SCHEMA>"."SampleCustomers" ("ID" TINYINT CS_INT, "NAME1" VARCHAR(2048), "STREET1" VARCHAR(2048), "HOUSNO" TINYINT CS_INT, "POSTC" SMALLINT CS_INT, "CNTRY" VARCHAR(2048), "LATID" VARCHAR(2048), "LONGI" VARCHAR(2048), "SURR_ID" NVARCHAR(1) NOT NULL , "EMAIL" NVARCHAR(200), "DATE" DATE CS_DAYDATE, PRIMARY KEY ("SURR_ID"))







insert into "<SCHEMA>"."SampleCustomers" values(1,'Jack','Rue du 31 Décembre',68,'1207','CH','?','?','1','sample@sampleweb.de','14.04.2014'); 
insert into "<SCHEMA>"."SampleCustomers" values(2,'Pete','Chemin du Château-Sec',2,'1009','CH','?','?','2','testsdi@web.de','14.02.2013'); 
insert into "<SCHEMA>"."SampleCustomers" values(3,'Norman','Mythenquai',5,'8002','CH','?','?','3','hottopics@newsflash.de','14.01.2012'); 
insert into "<SCHEMA>"."SampleCustomers" values(4,'Steve','Tösstalstrasse',76,'8400','CH','?','?','4','mcfly@combo.box','14.07.2016');
insert into "<SCHEMA>"."SampleCustomers" values(5,'Josef','Erlenstrasse',1,'8245','CH','?','?','5','see_you@star.box','14.01.2011');



  • Create a new flowgraph in your Web-based development workbench
  • Drag a data source onto your canvas

After creating a new flowgraph, go to your canvas and begin...

What I did first is add a filter node onto my canvas. This filter node serves me to add additional columns:

  • NAME1_MASK
  • STREET1_MASK
  • EMAIL_MASK
  • HOUSENO_MASK


These columns are mapped to the corresponding source columns 1:1 --> NAME1, STREET1, EMAIL, HOUSENO. These columns are later on used to apply the masking/variance rules. The purpose of this is to be able to compare masking/variance results and original values from our source table.



In the filter node (1) add new columns to output and (2) map them by drag&drop operation to their corresponding source columns.

  • NAME1                à NAME1_MASK
  • STREET1             à STREET1_MASK
  • EMAIL                  à EMAIL_MASK
  • HOUSENO          à HOUSENO_MASK

Drag&drop a Data Mask node onto the canvas and connect the filter node:


Double click on the Data Mask node to specify your masking settings. Before saving you should make sure that you defined an output node, e.g. a template table. In this example we won’t always load data into the target but rather use the JIT preview option that can be set on from within the Data Mask node.



In general, consider the following data types that are supported from within the Data Mask node (a column of type CHAR won't work!):


For each column in the Data Mask node, there are three different options (you find this in the node details of you data mask node at the bottom of the page):

  • Mask
  • Pattern variance
  • Data Variance
  • Numeric Variance

(3) Pattern variance and mask rule


In a first step, what we are going to do is mask everything between character 7 & 88 with a “$” sign (for NAME1_MASK column). Everything before character 7 and 88 will be juggled.

Select the mask rule for column NAME1 or any given character string based field that matches one of the data types shown above:


Having applied the mask rule for field NAME1_MASK we move ahead and apply a pattern variance rule for column STREET1_MASK.

The pattern variance option is available for the following data types:

  • ALPHANUM
  • NVARCHAR
  • VARCHAR
  • SHORTTEXT


With the pattern variance option you can specify your masking settings for each defined set of characters (by specifying its range with the slider object). You can choose between:

  • PRESERVE

        No masking, preserves source characters

  • CHARACTER

        Mask source character with defined mask character

  • STRING

        Mask source character with defined mask string

  • DEFAULT

        Mask source character by random alphanum/numeric value


You can use as many pattern variance types shown above as you want. Just click the "+" sign that you see in the image below and specify the settings, masking values and thresholds.



Applying the pattern variance option in our test scenario:


As I enabled „JIT-Preview“ in my Data Mask node, I am now able to preview how the output would look like:

Column “NAME1”/”STREET1” – original values, column “NAME1_MASK”/“STREET1_MASK” – masked values:

We can now play around with the masking options and see how the output looks like:

(a) Change from masking first 3 characters to last 3 characters for NAME1_MASK field and using the String variance type for STREET1_MASK field.

(b) Using the Preserve and Character variance type:

Column “NAME1”/”STREET1” – original values, column “NAME1_MASK”/“STREET1_MASK” – masked values:

(c) Using another String mask for field STREET1_MASK:

Column “NAME1”/”STREET1” – original values, column “NAME1_MASK”/“STREET1_MASK” – masked values:

(4) Numeric Variance rule

The numeric variance rule is applicable for the following data types:

  • BIGINT
  • DOUBLE
  • INTEGER
  • REAL
  • SMALLDECIMAL
  • SMALLINT
  • TINYINT

The following numeric variance types can be applied for numeric data types:

  • Percentage

        The variance is not an absolute number (cf. “Fixed Number”) but calculated as a percentage of the numeric value that is passed. So for each different             database records different min and max values will occur.

  • Fixed Number

        The fixed number numeric variance type will result in a value that is within the defined thresholds of min & max and corresponds to the defined variance.

  • Range

        A range will result in randomized values that are within the range, not being bound to any variance values.

In the following we will apply the "Fixed Number" as well as the "Range" numeric variance type.

(a) Numeric variance: „Range“

Column “ID” – original values, column “ID_MASK” – altered values:


(b) Numeric variance: „Fixed Number“

Column “ID” – original values, column “ID_MASK” – altered values:

(5) Email Masking (with Mask rule)


We will mask all characters of email field while maintaining the email format:

Column “EMAIL” – original values, column “EMAIL_MASK” – masked values:

Another option that you can make use of is either keep last or first x characters of your email field, email format maintained:

(6) Date Variance rule


The data variance rule allows for varying dates by specifying minimum/maximum thresholds and values how e.g. days/months or years are supposed to vary from the original date.


This is where you select the date variance rule for a source field:

The date variance rule is applicable for the following data types only:

  • DATE
  • SECONDDATE
  • TIMESTAMP

The variance type for our date field will be “Fixed months” type. There are also some alternative date variance types to choose from:

Applying the date variance type “Fixed Months” in our example: All randomized dates will be within the threshold values of “Jan 1, 2010” and “Jan 1, 2020”. The variance parameter determines the maximum number of months (or days/years) that are subtracted or added on top of the source value to be varied.


(7) Data source - Reconciliation option (since SDI/SDQ SPS12)

While modeling the flowgraph I also used the new reconciliation option that came with HANA SPS12.

This option enables you to reconcile your data sources used in any flowgraph to keep them up to date if some alterations happened (new column, removed column, changed field length…).

I hope this blog entry could help you deal with the data mask node.

Let me know if something is misleading or if you don’t understand any part. I’m open to constructive criticism or fruitful discussions 😉

Best regards

Stefan