Anonymization: Analyze sensitive data without compromising privacy
When is data truly anonymized? You can probably remember several cases where organizations such as public transport organizations or telecommunication providers published insufficiently “anonymized” data sets resulting in very damaging highly visible news headlines.
This is not to do any finger-pointing, because you know what? Anonymization is really hard! For many real-life use cases it isn’t enough to just substitute names with pseudonyms, or mask some of the values. With a little additional background knowledge it is often possible to identify the individuals you thought had been anonymized.
Organizations are increasingly looking for ways to reconcile modern data-centric business use cases with stringent privacy regulations like the General Data Protection Regulation (GDPR). So how can organizations make sure they do the right thing, and show that they are taking their digital responsibility seriously?
SAP wants to support customers on their digital transformation journey and let them turn the privacy challenge into an opportunity. Our vision is to provide real-time anonymized access to data and by doing so make data available for uses cases previously prevented by data protection and privacy regulations.
Read more about how to turn the data privacy challenge into business value in this blog by Daniel Schneiss.
Data Anonymization – available now
The SAP HANA team has been putting a lot of thought and research into how to best help customers to safeguard data privacy, while unlocking the full potential of their data in modern analytic use cases. With SAP HANA 2.0 SPS 03, we have released a customizable functionality that allows organizations to anonymize live data – by providing an anonymized view of their data in SAP HANA. For more information on the new security features in this release, check out this blog.
Let me briefly explain in a bit more detail what differential privacy and k-anonymity are about. These methods come into play after obvious protection measures for direct identifiers have been applied, like pseudonymizing real names or masking social security numbers.
Differential privacy adds random noise to your data, for example to salary amounts in an employee survey. Looking at individual records, you won’t get any meaningful results and thus the privacy of individuals is protected. However the noise is added in such a statistically clever way that it allows you to still gain valid numerical insights when doing analytics on the whole data set.
k-anonymity hides individuals in groups by generalizing some of the values in the data set. Looking at census data, this could for example mean to not list actual birth dates, but only operate with year or decade ranges. Or looking at ZIP codes, this could mean generalizing according to hierarchies such as city or county. The number “k” specifies the minimum number of members in each of these groups in a data set.
This is just a brief introduction into the anonymization methods. For more information, read the blog Going beyond masking: how to anonymize large data sets, watch this video or read the documentation.
What can you do with anonymization that wasn’t possible before?
The examples above already hint at some potential use cases, but there are many more, for example
- Data as a service, where cloud providers could give access to anonymized user profile data for advertising purposes, or telecommunication providers give access to anonymized location data for city planning purposes.
- Telemetry and IoT, where car fleet managers could share anonymized car usage patterns with manufacturers, or energy suppliers could provide smart meter analytics based on anonymized usage data.
- Healthcare, where hospitals could make anonymized patient data available for researchers and insurers
- Archiving, where insurers could store anonymized historical data to be able to keep it even after the legal deletion periods
In the use cases above, anonymization is primarily applied to protect the privacy of individuals. But there is another whole dimension of use cases that are made possible by anonymization as well: analytics on business-confidential data. Businesses within a similar sector or peer-group could benchmark their performance against each other but without revealing detailed financial data or operational data.
You can probably think of some typical examples in your business area as well! To learn more about data anonymization, go to http://www.sap.com/data-anonymization
And just one last closing remark: managing secure data access and configuring systems securely continue to be critical operational tasks – none of that goes away. Anonymization is a new tool in the toolbox, aimed primarily at doing analytics on whole sets of data that were previously denied. It complements other security mechanisms such as masking, authorization, and encryption. SAP HANA has security built into its core, with a comprehensive framework and tooling for authentication and single sign-on, authorization and role management, user and identity management, audit logging, secure configuration and encryption. Find out more about SAP HANA security at http://www.sap.com/hanasecurity