Anonymization: Analyze sensitive data without compromising privacy
When is data truly anonymized? You can probably remember several cases where organizations such as public transport organizations or telecommunication providers published insufficiently “anonymized” data sets resulting in very damaging highly visible news headlines.
This is not to do any finger-pointing, because you know what? Anonymization is really hard! For many real-life use cases it isn’t enough to just substitute names with pseudonyms, or mask some of the values. With a little additional background knowledge it is often possible to identify the individuals you thought had been anonymized.
Organizations are increasingly looking for ways to reconcile modern data-centric business use cases with stringent privacy regulations like the General Data Protection Regulation (GDPR). So how can organizations make sure they do the right thing, and show that they are taking their digital responsibility seriously?
SAP wants to support customers on their digital transformation journey and let them turn the privacy challenge into an opportunity. Our vision is to provide real-time anonymized access to data and by doing so make data available for uses cases previously prevented by data protection and privacy regulations.
Read more about how to turn the data privacy challenge into business value in this blog by Daniel Schneiss.
Data Anonymization – available now
The SAP HANA team has been putting a lot of thought and research into how to best help customers to safeguard data privacy, while unlocking the full potential of their data in modern analytic use cases. With SAP HANA 2.0 SPS 03, we have released a customizable functionality that allows organizations to anonymize live data – by providing an anonymized view of their data in SAP HANA. For more information on the new security features in this release, check out this blog.
Let me briefly explain in a bit more detail what differential privacy and k-anonymity are about. These methods come into play after obvious protection measures for direct identifiers have been applied, like pseudonymizing real names or masking social security numbers.
Differential privacy adds random noise to your data, for example to salary amounts in an employee survey. Looking at individual records, you won’t get any meaningful results and thus the privacy of individuals is protected. However the noise is added in such a statistically clever way that it allows you to still gain valid numerical insights when doing analytics on the whole data set.
k-anonymity hides individuals in groups by generalizing some of the values in the data set. Looking at census data, this could for example mean to not list actual birth dates, but only operate with year or decade ranges. Or looking at ZIP codes, this could mean generalizing according to hierarchies such as city or county. The number “k” specifies the minimum number of members in each of these groups in a data set.
This is just a brief introduction into the anonymization methods. For more information, read the blog Going beyond masking: how to anonymize large data sets, watch this video or read the documentation.
What can you do with anonymization that wasn’t possible before?
The examples above already hint at some potential use cases, but there are many more, for example
- Data as a service, where cloud providers could give access to anonymized user profile data for advertising purposes, or telecommunication providers give access to anonymized location data for city planning purposes.
- Telemetry and IoT, where car fleet managers could share anonymized car usage patterns with manufacturers, or energy suppliers could provide smart meter analytics based on anonymized usage data.
- Healthcare, where hospitals could make anonymized patient data available for researchers and insurers
- Archiving, where insurers could store anonymized historical data to be able to keep it even after the legal deletion periods
In the use cases above, anonymization is primarily applied to protect the privacy of individuals. But there is another whole dimension of use cases that are made possible by anonymization as well: analytics on business-confidential data. Businesses within a similar sector or peer-group could benchmark their performance against each other but without revealing detailed financial data or operational data.
You can probably think of some typical examples in your business area as well! To learn more about data anonymization, go to http://www.sap.com/data-anonymization
And just one last closing remark: managing secure data access and configuring systems securely continue to be critical operational tasks – none of that goes away. Anonymization is a new tool in the toolbox, aimed primarily at doing analytics on whole sets of data that were previously denied. It complements other security mechanisms such as masking, authorization, and encryption. SAP HANA has security built into its core, with a comprehensive framework and tooling for authentication and single sign-on, authorization and role management, user and identity management, audit logging, secure configuration and encryption. Find out more about SAP HANA security at http://www.sap.com/hanasecurity
The link you provide here ( https://wp.me/p5oBjm-4fwq which the resolves to https://blogs.saphana.com/? p=1012982 ) seems not to work: I get a "Page Not Found!" there:
Hi Joachim, this was a timing issue, the link should work now.
Thank you for being an early tester!
True, it works now, thanks Volker!
Plus, I leard that there is not only blogs.sap.com (I knew that) but also blogs.saphana.com ?!
Thanks for letting us know. The link should now work again.
Andrea Kristen is K-anonymization hierarchy only available for WEBIDE or is it available with HANA studio? Does K-anonymization work with Lumira?
modelling calcuation views (where the anonymization feature is integrated) is deprecated in the HANA studio, so it is only available in the WEBIDE.
K-Anonymity works with Lumira if it consumes the anonymized Calculation View.
Hope this helps 🙂
On of my colleagues is looking for that. How does it work? I have look TechED stuff, your blogs, the HANA security guide and found nothing about that use case.
Do I have to define a view on the the data (with anonoymization), then write down this data to the data base and finally deleted the original data? Or does it work?
this is exactly how it works. (Disclaimer: this is not a legal advice!) In general, deletion periods apply to personal data, once it is anonymized - or in other words, not personal anymore - those rules do not apply. The HANA approach itself is in real time, that means the original data stays in the system, however you can export the anonymized data, and then delete the original one. This is actually quite simple with SQL: "create table AnonT as select * from T" and the anonymized result is persisted.
Did that help? Feel also free to reach out to me or Andrea directly!
that helped me a little. After a phone call my colleague told me this:
We have data like an user-id which is stored in SAP standard tables in ERP oder later for us in S/4.
After 10 years we have to delete this data.
We do not want (like you suggested) copy the content to a new table, we want to anonymize the original table (so that all the SAP logic still works).
Is that possible?
Thanks in advance
we cannot do this directly on the table, but what you could do is persist the anonymized result - as described in my previous post, delete the "original data", and copy the anonymized data back.
This is not a very nice approach, but can be handled within a single transaction, thus making it safe.
Does that help?
if the business logic is still intact, then yes.