SAP Data Intelligence: Data Synthesizer for Machine Learning Operator
Data privacy has become a worldwide popular topic and received increasing attention from organizations, companies as well as individuals. There are more and more regional regulations get released focusing on data privacy protection like GDPR, CCSL, CCPA. The data required for the growing number of machine learning optimization and personalization use cases is often personal data, as defined by the various data protection regulations. How to facilitate users to comply with privacy regulations when using data is an important topic. The data synthesizer for Machine Learning operator intends to provide an effective method to protect personal data in order to comply with privacy regulations.
Data Synthesizer for Machine Learning operator released as a free content package for Data Intelligence to generate data that hides private information and retains data utility for later machine learning.
Table of Contents
- How to use it?
The operator handles structured dataset, which takes CSV data as input and outputs a synthesized dataset.
The mechanism is referenced from a published paper PriBayes, which is based on Bayesian Network and Differential Privacy (DP). DP is the most popular privacy model in recent years, which has a high guarantee of data privacy. Combined with Bayesian Network, it can maintain data attribution and correlations as well.
How to use it?
Here we guide you on how to use the Data Synthesizer for Machine Learning operator in the pipeline.
Launch SAP Data Intelligence
Logon to SAP Data Intelligence and the Launchpad opens. Select the “System Management” tile.
Download and Import Data Synthesizer for Machine Learning Operator
Download the operator from here and import it into the “System Management”. Click “Files”, select “Import Solution” in “My Workspace”, and import the operator.
After import success, click on “Modeler” on the Home page. Select “Operators” on the left panel, and search data synthesizer to ensure Data Synthesizer has been imported successfully
Build a simple pipeline to use the Data Synthesizer Operator. Select “Graphs” on the left panel of “Modeler”, click “+” to create a new graph.
Select the required operators from the “operators”: “Read File”, “Data Synthesizer”, “Write File” and “Graph Terminator”. Connect the operators as follows:
Set the configuration for each operator.
Click the “ReadFile”, set your own file path in the “Configuration” on the right panel
note: The default root path is System Management Files, you can upload your own data there, or update the data to the data lake.
For the “Data Synthesizer” operator, there are several flexible parameters for users to choose from, all these parameters have a default value, you can just use it.
Parameters as follows:
- “Epsilon” is the privacy budget in Differential Privacy, which can be used to balance data privacy and utility, the default value is 0.1.
- User can specify the columns for “Pseudonym”, which means synthesizing the column with a Pseudonym algorithm, the default value is none.
- User can also specify columns to “Retain” or “delete” in the source data, the default value is none.
- “hasHeader” indicates if there is a header specified in the input CSV file. If there is no header in source csv, operator will generate new headers for it, the default value is True.
- “Categories” can specify if the source column is categorical, if not, the operator will check it by itself in the code, the default value is none.
- “Records” can specify the records you want to generate; the default value is the same as the original data set.
Click the “Write File”, set the synthesized file path in the “Configuration” on the right panel
Click “run” button to run the pipeline, after the “Status” shows completed, go to the workspace to check the synthesized data and take the synthesized data in your following work.
Data Synthesizer for Machine Learning can mitigate the privacy concerns of users and facilitate data using for machine learning.