Greater Than the Sum of the Parts
Data is key to understanding every aspect of an enterprise – from customer and employee behavior to market trends. But while the value of data is indisputable, it is also not always easy to unlock. Data is rightfully protected and respecting this protection is both a legal and moral imperative for all organizations.
Utilizing (personal) data for analytics and machine learning has the potential to improve our lives, our environment, and our health. It can help forecast energy demands, resulting in a better use of renewable energy. It can help improve the way we manage traffic to avoid congestion and better plan our cities. It can help us uncover cures to fight diseases such as cancer. So the question remains – how can we unlock the potential insights of the data without posing any risk to the privacy of the individuals to whom it belongs?
This is a question I have been working on since I first started my Ph.D. in data protection almost a decade ago. Then, after I’d been working at SAP for about a year in the Big Data team, I saw a customer presentation about the potential of their marketing ambitions. The customer explained that the company was limited in what it could do due out of respect to the privacy of the data.
I recognized that this was a huge opportunity with a wide variety of potential use cases and so I began to develop a way to productize anonymization methods – essentially turning my research into a product.
At the end of 2016, I got the chance to make my dream come true and began work on the SAP HANA Data Anonymization functionality. Data anonymization methods allow enterprises to use the data for applications and analysis while still ensuring everyone’s privacy is protected. To do this, it’s not enough to simply remove names or other kinds of identifiers such as social security numbers to render a dataset anonymous.
As an example, imagine a classroom in which the teacher asks the pupil in the red shirt to leave the room. Assuming that there is only one person with a red shirt in this room, everybody will know who needs to leave the room without the teacher having to identify the student by name. Simply by virtue of the fact that there is only one person with a red shirt in the room it’s possible to work out exactly to whom the teacher is referring. The situation completely changes if there are many people in the room wearing red shirts. In this case, no one would have known who the teacher meant: The specific individual is hidden in a crowd.
This is the same fundamental principle that we apply in one of the data anonymization methods in SAP HANA. We make sure that there are at least “k” individuals with the same properties (such as the red shirt) in the anonymous data set. This method is called k-anonymity and is one of the different anonymization methods from research that is implemented in SAP HANA to provide different privacy and utility guarantees. Using well–researched methods and being transparent about how anonymization works is key to building trust while dealing with very sensitive data. This is one of the reasons we published our work at the prestigious VLDB conference. Ultimately, this also allows us to create new applications that would have been unthinkable before.
Today, this technology is used by a wide range of organizations, helping them to derive invaluable insights from sensitive information such as healthcare data without revealing anything about the people behind it.
I now work with three other colleagues on this topic. In addition to building the software, a large part of my job is also about raising awareness around what it can do, how it can help customers, and introducing others to the technology and to our software.
One of my personal highlights was demoing the software at an employee meeting in front of thousands of colleagues. We had just a few minutes to explain this very technical topic, and it was a great exercise in learning how to really focus on the core message of the software. Yes, there was definitely an element of stage fright, but it was also great fun.
But beyond the presentations, one of the main highlights for me is actually the way the colleagues working on SAP HANA collaborate. SAP HANA obviously provides the in-memory speed and performance, but it also goes beyond core database management with application development, multi-model processing, and data integration and quality capabilities.
What makes SAP HANA Data Anonymization so unique is the fact that we are part of this broader set of capabilities which all work seamlessly together. No one else on the market offers the same kind of integrated data anonymization, so from an architectural point of view, we are not offering anonymization alone, but anonymization integrated in a greater security framework and processing engines, such as spatial, too.
For example, SAP HANA manages the original personal and sensitive data, as well as the anonymized view of such data. The security framework has to make sure that users only get access to the data that they are allowed to see. The access needs to be auditable as well, so anonymization always works in the context of the greater security framework.
A second advantage of having all these capabilities in one integrated product is the broad knowledge within the team itself. The SAP HANA team consists of experts across a huge range of topics. It’s a large but active community of developers and product managers who are always open to technical discussions.
Just like these features and functions all come together seamlessly in the product, we all come together as a team. I call it “sharpening the features” – if someone has an idea, they can talk to the colleagues from other areas of SAP HANA. This combined expertise and the different perspectives mean that any ideas we might have for a specific area is ultimately refined and further improved. So it’s not just the anonymization nerds working on their own, but the whole team is working out how this idea fits in and complements other elements of SAP HANA. The end result is a better feature and product for the customer.
Going forward, we’re looking at additional use cases for this technology. The potential is basically limitless, and we’re excited to work with customers on new proof of concepts. If you have a case in mind, let’s get in touch!
For more information on SAP HANA data anonymization visit https://www.sap.com/data-anonymization
Thanks for the update on the work on the anonymization features in SAP HANA.
So far the feature set has not been covered all that much in the usual how-to-use-SAP-tech channels. This leads to the impression that it is not actually used all that much.
Is this impression correct?
I miss actual references that tell about what they implemented with the features and how it works for them now.
There are the occasional "this could be good for users in the healthcare industry"-messages/posts, but I never see any actual implementation reports. In fact, the discussions I recall around this feature with healthcare providers and research organizations did not lead to a straight implementation (for various reasons beyond the feature set of HANA).
With the recent HANA 2 SPS 05, the anon. features are also available with plain SQL views (not just with graphical calc. views) which definitively allows for using them in more development scenarios.
As CDS is still the general direction SAP is pushing for in data model declaration, is there support for this on the horizon?
Going further, I'd be interested to see actual data-sharing scenarios and how they get implemented with the HANA features.
Things like "sharing data of the same cohort multiple times (e.g. data update) - will the result records still point to the same individuals?"
"can I ensure stable anonymization per data recipient?" (e.g. I share data with 3 consumers and for each of them the data about individual XZY should always resolve to the same records, but to different records for each of the consumers"
With such informations it would probably be easier for many potential users to envision how those features could work in their environment.
thanks again & cheers,
Thank you so much for your feedback!
Topics involving the use of personal and sensitive data are always delicate to be speaking about in public. We have a list of customers using the SAP HANA data anonymization feature in the following industries: automotive, banking, insurance, healthcare, life sciences and public sector. Of course, the feature can also be used in other industry scenarios. We are actively working on more customer references that can be made available publicly.
SAP HANA data anonymization is available via the HDI interface in SAP HANA Cloud. Those HDI artefacts serve as transportable entities for anonymized SAP HANA views.
With respect to multiple data release, we have several measures in place in SAP HANA that safeguard such releases of changing data set over time. I would be more than happy to discuss further use cases and implementation scenarios. I will reach out to you!
Thanks again for your valuable feedback. It is much appreciated.
thanks for taking the time to respond here.
I think that publishing customer references and usage scenarios would help a great deal with mentally placing this technology in any organization's technology stack.
In my experience so far, the majority of customers are more interested in how a certain scenario can work and less in all the building blocks of features.