Customer Engagement Initiative on DATA scrambling ...

former_member20974 · ‎06-03-2022

This blog post is targeted to data enthusiasts/ data architects and those looking for privacy preservation of data in SAP applications like SAP S/4HANA based on SAP HANA Database.

Introduction

In 2019, when customers were still struggling with data privacy solutions available for application data, SAP Customer Engagement Initiative project, showed up with a promising solution. Below is presented a review of what this solution provided to multiple customers, presented as a dialog between the "Data scrambling project" team of the Customer Engagement Initiative project and a "Customer". Information presented below is not only from one project but multiple projects, presented as a discussion with one customer, with multiple projects.

Data Scrambling Project: Why do you need a solution for scrambling of data in applications?

Customer: Personal and sensitive data is always present in the operational systems. Companies many a times, need this data to test new developments or test integration of a new software for business. IT teams create sandbox / copy of production in cloud system landscape. Such a copy needs to be protected with same authorizations as production environments, until we adopt all necessary technical and organizational measures (including data anonymization where required) to protect any personal information. As an owner of SAP applications choose to scramble information in such environments in order to protect the privacy of our customers, employees and suppliers and keep restricted access to even scrambled data but enable our teams to test new developments and other usage of such scrambled data as permitted by law.

DSP: How do you install this solution for use?

Customer: This tool is installed on a docker container on a linux virtual machine. Installation was done within minutes with a DevOps specialist. The application uses port on the machine to enable access to User interface via any browser. Also the application has built in https communication to keep further communication to any application in the network fully secure.

DSP: How did you use this solution?

Customer: There is a user interface to access SAP HANA database of application via JDBC connection. The user interface was simple and very similar to the interface for accessing SAP HANA database in general via the cloud cockpit or SAP HANA studio. With access to the database, the user interface populates information specific to the application to be scrambled using this tool.

DSP: What can be achieved with this solution?

Customer: There are three major aspects of our requirement, which could be met with this solution.

First being the feature to help identify with good accuracy, fields in the application, which store personal information. SAP HANA platform capability for Text analysis is used to find entity types, which may contain personal information like person name, email id, IP address and others. The solution used to provide information on which fields has which entity types. This gave a good insights into use and mis-use of fields by our business users. At the same time, it helped identify fields, which were not anticipated to contain personal information.

Second major requirement met with this solution was to scramble data in order to have zero traceability to the original value of the field. Once scrambled, the information in the system, was good in structure and format for the application to run.

Third and the most difficult requirement was to keep running cross application business transactions. Multiple systems with scrambled data must be able to represent the same customer, which is now scrambled across systems. Such a consistency is expected and most difficult to achieve, considering vital information like User ID is scrambled and is present across systems. Without such consistency the application would not run with scrambled data as it used to run with the original data.

DSP: What do you mean by application able to run as it used to?

Customer: Scrambling done with concepts like hashing and cryptography present data which is not meaningful. And in many cases, there can be a name replaced with a string, which is larger than the size of the string which fits the application requirement. This can lead to application inconsistency and issues like short dumps in SAP ABAP systems, when we run it with scrambled data. On the other hand, if we use simple methods like pseudonymization with a dictionary or make rules on how data is scrambled, there is always a risk of tracing the scrambled data back to the original data.

The solution provided by this Customer Engagement Initiative project from SAP, helped meet both the requirements from application as well as from the scrambling relevant for compliance.

DSP: What were the main challenges faced during the project?

Customer: There were many challenges in creating the solution. Some of these are outlined below and two of them are detailed further.

ability to find mis-use of fields to store personal information

identify business logic and boundary conditions to only scramble relevant data,

scrambling figures like salary of employees,

generating unique identifiers, which are not original but satisfy rules of validation by the application

scrambling country specific information

scrambling entities like IBAN with generated IBANs, which meet international rules

retaining consistency for a cross application business processes

matching consistency of key fields (explained in details below)

performance for scrambling large tables and short run-times for scrambling overall

One of the major challenge was the requirement to scramble user IDs for our employees, contractors and customers. These user IDs exist in thousands of tables. Many of these tables are linked with each other using primary key – foreign key relationships. So it was a must have to scramble all the user IDs consistently across the whole of Database. This problem was amplified by the numerous customer enhancements in the system with a lot of Z-tables and Z-fields. If not taken care appropriately, the system would produce inconsistent results when running transactions in the system. Or even doing analytics on this scrambled data.

Second major challenge was performance as there were tables with over billions of records with personal information. It is not possible to perform replacements at such massive scale in a short span of time. This leads to large periods of time, from when a copy to the system is created till, when this copy is usable after data scrambling.

DSP: How were these problems resolved by the solution from SAP?

Customer: The team worked hard with a futuristic vision to solve these problems in a timely manner with changes in architecture and overall approach used in the solution. Some of the challenges like key fields were addressed by building strong automated checks to identify the presence of primary keys like User IDs in tables across the application. This helped retain application consistency while scrambling a primary key like User ID.

For managing performance issues, the solution built several technical measures to parallelize the operation, partition the tables while being scrambled (not changing the original tables) and managing the scrambling runs with multi-threading and many other such improvements within the first cycle of the project itself. Forwarded thinking of SAP, helped make the project smooth and complete within reasonable duration meeting desired results.

DSP: Can you give the scale of scrambling performed?

Customer: For different systems in the landscape, it differs as different systems have different datasets and the volume is also different for personal data. To give one example, the solution performed scrambling of more than 50 Billion values in a SAP HANA database with personal information comprised in over 10,000 tables and 25,000 fields in the database in one of the systems. 4 different applications were scrambled in one hybrid landscape and cross application consistency was verified.

DSP: How would you summarise overall experience in this engagement with SAP?

Customer: This engagement for scrambling data in our systems, which have the SAP HANA Database like SAP S/4HANA, SAP MDG and others, was a great experience. It can be summarised with the three aspects mainly compliance, increased utility and reduced time to implement.

System Anonymization together with other safeguards definitely is an important cornerstone of Data Protection and Privacy Compliance and proper processing of personal data. This is not only referring to GDPR, but many more data privacy protection laws around the world.

Finally, we move towards a higher compliance and also enable Information Technology colleagues to do their daily job in a compliant and safe environment. For sure, this also reduces the risk for fines though the increased utility for data, product defects reduction, reduced time for integration tests, Machine learning data utility shine as the main benefits.

Previously implemented solutions took much longer to implement compared to this solution and the flexibility to update based on compliance and data security audit was phenomenal. From an operations perspective the solution is easier to handle (e.g. set new anonymization parameters) and can be operated by a smaller group compared to the other solutions.

As a follow up to this blog post, you may read more on this project and future plans from SAP in the Customer Engagement Initiative page https://influence.sap.com/sap/ino/#/campaign/2033. You may leave your comment and follow this post to be updated with comments from other readers.

The SAP Customer Engagement Initiative connects SAP development teams with our customers and partners early and regularly. The initiative fosters and enables close interaction between SAP and our customers and partners by providing a structured approach and a legal framework. You can also find more information about SAP Customer Experience on our topic page and post your questions here.