Mask Functionality in DS 4.2 SP 04
Intro
In my post SAP Data Services 4.2 SP4 New Features I highlighted some new functionality introduced in Data Services 4.2 SP04. One of the new functionality was the ability for Data Services to mask data. This blog focuses on how you can mask data.
The Data Mask transform enables you to protect personally identifiable information in your data. Personal information includes data such as credit card numbers, salary information, birth dates, personal identification numbers, or bank account numbers. You may want to use data masking to support security and privacy policies, and to protect your customer or employee information from possible theft or exploitation.
Data Mask Example
Here is an example of a very basic data flow with the mask being used.
We then on the input tab indicate fields to mask
Then fill in the options tab. For every field you want to mask you must duplicate the mask section. Then assign a field to each mask out section.
Then on the output tab choose the fields you want to output
Here is a sample of the masking by using View Design-Time Data button.
Conclusion
The masking option is a nice addition. Very welcomed. It however feels a bit incomplete. Many places we use masking here in South Africa is to mask credit card numbers, we however only mask for example the middle part of the credit card number. Currently this mask only allows you to say where to start from and it will then mask the remaining characters. So I would like to see an option to stipulate where to start and how many characters to mask.
Also I noticed that when the field names have spaces you get a few errors saying the mapping is not complete. Once I removed spaces in the fields I was error free.
Hope the above helps.
Thanks.
Hi Louis,
First thank you for the great write up on our new Data Mask transform.
You wish has been granted, regarding the ability to mask out the middle portion of credit card numbers. In Data Services 4.2 SP5 release we have enhanced our Data Mask transform to allow users to mask data that may follow a specific pattern, such as credit card numbers, personal identification numbers, bank accounts and so on. Users will have the ability to mask out the entire pattern or specific portions of the pattern.
New Pattern Variance Options, available in Data Services 4.2 SP5:
As you mention, the Data Mask transform is a new transform that was delivered in Data Services 4.2 SP3 and we continue to enhance the functionality. Please feel free to contact me, via this medium, if you have additional enhancements you would like to see.
Thank you,
Lynne
Hi Lynne
That is great news.
Thanks for the update.
Louis
Hi Louis,
Would you mind sending me your email information to me ?
Thank you,
Lynne
Just popped you an e-mail. 😉
Hi Louis,
Thanks for presenting a nice article.
My concern here is that DS is already having a inbuilt function called encrypt_aes() & decrypt_aes() to mask or safeguard data with in the DS staging level.
How can this new transform will be different?
Hi Kolli,
This transform is different in lots of ways. The Data Mask transform allows you to mask sensitive data while keeping the data relevant so it can be used by other systems and allows you to keep the referential integrity. I've added some examples below of what the data mask transform can do for you:
Preserving relationships:
In a normalized relational database quite frequently we rely on data itself to
be a key of the table, for example social security number is unique and could
be the primary key of that table. In this case references to SS# in other
tables (foreign keys) should match the values of the primary key, otherwise it
will be impossible to relate the tables. For example one table may contain SS#
and names, other may contain SS# and purchase history and the tables need to be joined on SS#.
Preserving the shape of the data set:
Blanking out sensitive data also makes it hard to do analysis on it down
stream. For example, changing all birthdates to 1/1/1900 while hiding
information makes it hard to do demographic analysis. In this particular
example it may better to scramble the day and month while retaining the year. Similarly, though we may de-identify zip codes, we want the number of rows mapping to the obfuscated zip code to remain the same so we know how the data is distributed without knowing the actual data
Keeping the data sensible:
People consuming the data downstream will find it hard if they are dealing with nonsensical data. For example if we convert a name like ‘John W. Duncan' to ‘@#R%amkGG87%%’ it is jarring both visually and also creates problems for other programs operating on this data downstream (for example asserting that name should be made of alphabets). It would be better if we convert ‘John W. Duncan’ to ‘Scott T. Smith’ instead. This also holds for address information, ‘TX’ should be scrambled to something like ‘IL’ not ‘ZZ’. Care should be taken that we don’t inadvertently provide wrong data – it would not do if we change a social security number and the new social security number identifies a different but valid person! Another point to keep in mind is the range of the de-identified value. For example, if we want to scramble the salary information of employees that new value should fall within a reasonable range – this could be an absolute number (between 10K and 1000K) or a percentage (+/- 50% of original value).
Preserving part of the data:
Quite often only part of the data is sensitive, for example in a credit card
number the first 6 digits identify country and issuing bank and that can be
open (and is useful information), but the rest of the digits should be hidden.
Similarly in an e-mail we may want to hide just the username part while leaving the domain open.
Preserving the format:
A simpler requirement (than the sensible data need) is to at least maintain the
format of the data. For example e-mail should map from lynne@sap.com to asdfa@xx_email.com rather than ‘elrjnmkmjnfer##’,
same thing with social security numbers.
Sufficient de-identification:
It is important that information isn’t reconstituted by linking the scrambled
information with other sources that are not scrambled. For example if in the
Salary table we obfuscate the names of people but leave their city of residence
open and if there is another table where the employees and their addresses are listed we can by cross reference infer information in the scrambled data set. For example we have: Name: John Smith Sal: $9,000,000 City: Scotts Vally. And in another table we have Name: John Schwarz City: Scotts Valley and there is only 1 employee in Scotts Valley ergo we know how much John makes. So it is important that combination of multiple fields be considered when creating the anonymization. This is also important when there are a limited number of source values, for e.g. if you want to hide the *** of a person no point mapping M to X and F to Y, it is easily deducible.
I hope this helps.
Lynne
Thanks Lynne, for your detailed explanation. 🙂
Great detail response. Thanks Lynne
Hi Lynne,
What is the ETA for release of Data Services 4.2 SP5.
Hi Pravin,
Data Services 4.2 SP5 has been released and is available for download.
Happy masking.
Lynne
Hi Louis,
how i can use the above platform to mask more than 6 numeric fields, as the above platform has only 6 numeric fields.
so there is any ability to add new numeric Fields to the same platform or i need to use more than one platform to do the masking in multi-level.
Hi Salam,
An option would be to have two Data Mask transforms in your dataflow. The first one could mask the first 6 numeric fields and the second transform could mask numeric field7-12.
I hope that helps answer your question.
Thanks,
Lynne
Thank you lynne,
It seems this is the only solution.. I already use this solution.. i hope if sap enhance this platform so that we can add new fields.
Thanks
Salam
Hi
Is there any way to unmask the data from masked data, without using referring the source data?
Thanks
Ram