Product: SAP HANA 2.0 SPS04
Feature: Differential Privacy
This article is part of a series focused on tapping into the enormous value locked away in a company’s privacy protected data. The first article demonstrates how one can utilize SAP HANA to analyze privately protected attributes through “generalization.” The second part of this series (this article), demonstrates the value and process of anonymizing sensitive measures.
Tap into the enormous value of your protected data
In a previous role, I had the pleasure of leading an extremely talented team of software engineers. As the team grew and positions on my team became available, I would work side-by-side with HR to establish the hiring salary range. The process was often challenging because I was not able to get meaningful salary comparisons due to the fact that salaries are always considered confidential information and the company’s reporting system simply was unable to anonymize comparative salaries.
Truth be told for most cases like mine, access to the actual private data is not as important as knowing accurate stats like mean, deviation, sum, etc.
With differential privacy, any company can benefit from access to both sensitive and non-sensitive data. Data modelers/analysts now have the ability to maximize the value held by all of their company’s data. Senior management teams can make better decisions from more robust and comprehensive data sets. Most importantly, data that leads to better insight can also lead to new revenue streams. If access to a protected dataset adds value internally, how much more value could other companies in the marketplace add to the same protected data?
Anonymize sensitive measures
Differential privacy is a method that anonymizes measures by adding or subtracting “noise.” Simply expressed as a formula: Original Measure +/- Noise = Anonymized Measure. The amount of noise can be governed by the data modeler so that: (1) the anonymized measure closely resembles the original measure; (2) the anonymized measure does not closely resemble the original measure; or (3) the anonymized measure somewhat resembles the original measure.
A simple example might include three numbers: 100, 200, and 300. Together these numbers average to 200. After configuring the parameters to minimize the amount of noise, the output could contain 90, 205, and 298 with an average of 197.67.
See the tables below for another example using a simple, real-world use-case.
Implement with a Calculation View
1. Place an Anonymize Node into your view’s logic
2. Set the field mapping. Note you will need at least two fields: an “ID” field and a numeric field that will be the target of additional “noise”. Feel free to add additional fields needed in the output. The anonymize node will ignore them.
3. Configure the parameters: Sequence Column. This parameter is required and refers to the “ID” field from step 2. The acceptable data types are integer and big integer.
4. Configure the parameters: Epsilon. This parameter is required and determines the degree of difference between the final output and the original value. A typical number range for this variable might be 0.01 to 10. A lower input such as 0.01 will add/subtract the most noise against the original number. A higher input such as 5 or 10 will output results with very little noise.
5. Configure the parameters: Sensitivity. This parameter is required and specifies the amount of potential noise available to the Epsilon parameter. A general rule of thumb (though not required) is to use the difference between the maximum and minimum values of the column to be anonymized.
6. Configure the parameters: Noised Column. This parameter is required and refers to the numeric field from step 2. The acceptable data types are double or float.
7. Finish the rest of the graph flow. Save and build.
8. Test different inputs for Epsilon and Sensitivity to determine the optimal amount of statistical noise needed for the use case.
Unlock more value from your data today
Differential Privacy can enable your organization to analyze protected measures without compromising the actual data. Are there any use cases that you can apply Differential Privacy too? Which business problems can now be solved with the ability to anonymize data like salaries, actual costs, or sensitive GL postings?
To start using SAP HANA today, signup for SAP HANA in the Cloud or contact your account representative.
Privacy Protected Data Has Value Too! (Part 1 of 2): https://blogs.sap.com/2019/07/10/privacy-protected-data-has-value-too-part-1-of-2/
Andrea Kristen’s Blog Post: https://blogs.sap.com/2017/11/10/anonymization-analyze-sensitive-data-without-compromising-privacy/
SAP HANA Differential Privacy Documentation: https://help.sap.com/viewer/b3ee5778bc2e4a089d3299b82ec762a7/2.0.04/en-US/ace3f36bad754cc9bbfe2bf473fccf2f.html