Creating Secure Test Data
a weblog, by Mark S. Ciminello, MBA, PMP, CISSP, CCSP
updated March 5, 2020
Mark Ciminello is SAP’s Principal Engineer for Cloud Security and is SAP’s lead resource in the domains of Cloud (hosting), InfoSec, Data Privacy, Legal and Regulatory Compliance within the presales organization. He lives in the Phoenix area and has been with SAP since 2010, bringing well over 35+ years of experience in Project Management, Information Security and Cloud.
As a security professional, I often get asked, “what are the best practices for establishing good, clean test data?”
As a former developer with 30 years of experience, the answer and the standard practice I used was to simply copy over data from the production environment. That way, the data set would be good, it would be complete, and it would be real data. It’s a common practice, and many enterprises use the technique. Mocked up data, often referred to as dummy data, seldom if ever is adequate. It is extremely time consuming and tedious to create dummy data in a standalone environment, and it rarely offers the level of accuracy and integrity required to sufficiently test an environment.
This seems rational to me. By starting with real data from a live environment, you avoided entering dummy data – like “John Doe” as your customer, living at “123 Maple St.” There is, without question inherent value in having a full and complete data set, and a simple environment copy usually did the trick.
Think about it. In a typical enterprise that has hundreds or thousands of users, even tens of thousands of users will be generating data throughout the day and through the years. You can’t possibly duplicate that in a mocked up environment. So the practice of copying data sets from environment to environment makes perfect sense.
Over time, we would clear or delete the test environment, and refresh the data, copying over current data to the test environment and starting the process of testing all over again. In this manner, we always had good, clean data, and that generally eliminated the need to manually or programmatically create mock or dummy data.
The end goal is easier to achieve: a complete and comprehensive data set for any non-production environment for testing, training or quality assurance purposes.
Risking Exposure in Non-production Environments
Yet, that very practice carries with it a significant risk in exposing sensitive data.
Today, almost daily, there are companies in the headlines of the news who have accidently or inadvertently exposed the personal data of individuals. The breach list started small years ago, but has grown to the point at which data breaches for every enterprise, sadly, is becoming the norm.
I have a stack of breach notification letters to prove it.
Attackers Choose Least Path of Resistance
Attackers, those who have stolen the data, will generally take the path of least resistance. They will likely take the easy way in if that is an option to them. While many enterprises are concerned about firewalls, encryption and the like, they often overlook the simple ways to steal their data, like social engineering, phishing scams, and they fail to properly secure accounts and passwords.
If I, as an attacker, could somehow gain access to a user account and password, I would have quick and easy access, and wouldn’t have to attempt to use some very advanced complex hacking techniques to get in. Make no mistake. Getting in with a valid account and password is much easier than trying to hack into an encrypted session. With a user account and password, once you know the URL, you simply just sign on.
Testing Security in Non-production Environments
Now, by copying over data to a non-production environment, one of the things that needs to be tested is the security of that data. After all, you would normally want to test the security, as well as the functionality of the system. If you setup data permissions in your application, you’ll need to test that while you test the functions of the application. It is normal and expected that you test everything, or most certainly, conduct tests at a reasonable level to satisfy that you have been reasonable and prudent in your efforts.
Attackers know this, and actively seek non-production environments as they represent a least path of resistance. If you let down your guard – your force fields, or just drop your shields, you provide attackers with easy access. That makes non-production environments an easy target.
Arguably, if you copy over security, user accounts and have a comprehensive set of security practices in place, identical to a production environment, then there should not be an issue, in theory. But of course, now you have two environments, both of which are duplicated to a degree, and therefore twice as much to be concerned with.
But the problem is that many enterprises – as a bad practice – do not copy over the full security measures already established in the production environment. Most will not duplicate their security, and manually enter a select few accounts giving a restricted access to the non-production environment. Further, the roles and permissions are typically not copied, and those very same users now would have unfettered access to all the data. The assumption, albeit severely flawed, is that it is not a live environment and therefore we don’t need that level of security.
That’s a big mistake. A very big one.
In fact, I’ve seen a number of my customers put no permissions in place, and leave the environment wide open. “We’ll test later, and put add in security then” is the response. Wow. That’s a really bad practice. It’s like accessing the internet without a firewall or anti-malware solution in place.
Protecting Data in Non-Production Environments
As an SAP Enterprise Cloud Architect, I am all too aware of these practices. It’s very problematic, it’s a shortcut to setting up test environments, and a really bad practice from a security perspective. Really bad.
SAP, as an organization is also very much aware of this, and likely has significant experience with SAP customers who make this sort of mistake. For SAP, this potentially exposes SAP to risk as well, especially in SAP Cloud solutions, where SAP is the host. SAP is obligated to protect customer data while in its cloud, as well as customer personal data. SAP extends our best security practices to do so. But SAP loses all control when customers setup systems with little or no security. So arguably, SAP should not be held accountable for breaches caused by customers who make the mistake of not implementing appropriate security on their end.
I’m a little surprised that I actually have to say this aloud, because it seems fairly obvious to me, but security and protection of data is not just SAP’s responsibility, it is a shared responsibility. Both SAP and their customer must establish security practices, implement, and protect the enterprise and personal data for all environments.
For cloud solutions, knowing that customers frequently do this in practice, SAP makes it known that they will not accept responsibility then for protecting data in a non-production environment. It’s not that SAP’s hosting practices change or security measures vary by environment, it is more that customers may not act responsibly with their security practices and so SAP does not want to assume any risks of exposure or the liabilities and warranties associated therewith.
As a security professional, that makes complete sense to me. If you continue the practice of copying over production data and fail to secure it properly, you must accept full responsibility for doing so.
How to Protect Non-production Environments
So how then should we setup good test data in non-production environments without adding risk of exposing live data? We need good data, true, and so copying the data makes the most sense.
I don’t argue this point at all. My argument, from a security perspective is that the best practice to make sure relevant, reasonable and practical security measures are put in place to secure your test data. I also argue for not putting any data in a non-production environment that is classified as anything more than public data. That way, if it does get exposed, it causes no harm. That’s fairly important, and the driver and rationale behind this discussion.
Simply put, you should never put anything of value into a non-production environment, by policy. If you can make that statement with confidence, when your non-production environment does get breached, it create no issues, because the content of the environment is of no value.
So that’s my first recommendation. Don’t put anything of value into a non-production environment.
How do you do that? Tokenization of the data is a good and highly recommended practice.
Introducing Tokenization as a Method to Secure Data
The simplest solution I see is to copy over production data, as it has significant value for testing and QA purposes, but to tokenize that data.
Tokenization, by definition, is the process of substituting sensitive data – or really, any data classified at a higher level than public data – with a non-sensitive value, called the token, where the new data element has no value and carries with it no risk of exploitation. The token can be a randomized value (i.e., a dummy value), but as a better method of creating test data, a token should be a substituted and should be a tracked value. The problem with using dummy data is while that data element is one value in a single transaction or master data, it could potentially be a different value in a second transaction. While that’s easy to do with simple math and substitution, it does not preserve the integrity of the data.
According to Wikipedia, “Tokenization is often used in credit card processing. The PCI Council [Payment Card Industry] defines tokenization as ‘a process by which the primary account number (PAN) is replaced with a surrogate value called a token. De-tokenization is the reverse process of redeeming a token for its associated PAN value.’ ”
Consider your taxpayer ID, in the US, which is your Social Security number (SSN). Your SSN is not used very often. In fact, it should only be used for tax reporting purposes, typically within a payroll function. But it is often seen as the highest value target of personal data, and enterprises understand clearly the need to protect your SSN from exposure. You should not be using SSN’s unless there is a function that requires it. But let’s continue on for illustrative purposes.
In a tokenization system, as you copy over data elements from production to non-production, you substitute the real value with a token, and create a token record in a token database to track the new value.
Let’s first consider your Social Security Number as an example. A great explanation of how SSN’s are applied is explained in How Stuff Works, summarized as follows.
The nine-digit SSN, which has been issued in more than 400 million different sequences, is divided into three parts.
- Area numbers – The first three numbers originally represented the state or area in which a person first applied for a social security card. Numbers started in the northeast and moved westward.
- Group numbers – The two middle digits, which range from 01 through 99, are simply used to break all the SSNs with the same area number into smaller blocks to make administration easier. Group numbers issued first consist of the odd numbers from 01 through 09, and then even numbers from 10 through 98, within each area number assigned to a state. The numbers 00 or 99 would not be valid, nor would the number 02, 04, 06, or 08.
- Serial numbers – Within each group designation, serial numbers — the last four digits in an SSN — run consecutively from 0001 through 9999.
If your SSN was created in southern California, your SSN likely starts with 587 – 665. Numbers 667-679 and 681-699 are not used. The list of areas is readily found on the web, by the way. Numbers 772 and above are not used, so that gives some freedom and flexibility to use those numbers as substitution variables.
So to create a new token value for the SSN, translate the number from a real value to a fake value, such as 587 to 667. If your SSN is 587-01-1234, then your token value could be 667-02-1234.
What that does, in effect, is to translate the number to an invalid or mock SSN, by changing the Area Number and the Group Number to invalid numbers. As long as the algorithm is protected and not easily guessed, you will translate valid and real SSNs to fake ones. Therefore, if the data did get exposed, not only would the SSN be invalid, it could not be tied back to the individual person, and thus the risk of exposure is significantly, if not altogether reduced.
Within the context of security, tokenization is an obfuscation technique, where obfuscation is the simple act of hiding the real value and making it difficult to read the correct value. The masking of passwords during data entry is another example of an obfuscation technique.
Wikipedia has a great example of how tokenization is used for credit card processing, using a mobile app.
Creating a Tokenization Data Source
A key part of tokenization is not just the substitution of the value, but rather, the tracking of the new value. It’s important to data integrity to always translate the real value to the tokenized value. It’s not enough to just create a fake SSN. In every transaction in which SSN could exist, it must be translated from the real value to the new value in order to preserve data integrity.
Additionally, the end user must look up their “new” SSN in the test environment, to note what that new value is, and to leverage that in testing the system for functionality. In other words, my new SSN – from the above example – becomes 667-02-1234 instead of the real, true value.
What you need then, is a token server, a database of stored token values that can be used to translate data elements from real to fake values. The Token Server is typically a database, on premise or in the cloud, that is used to contain real, sensitive data, but offers data translation to token values. Needless to say, the Token Server must be highly secure, and the algorithm that translates values must neither be guessed, nor exposed. Otherwise, just like weak encryption or passwords, it could be reverse engineered to expose the real value.
Leveraging SAP Cloud Platform for Integrations and Tokenization
SAP Cloud Platform is typically used in most implementations for integration services. When leveraging SAP Cloud for solutions, Cloud Platform (CP) Integration Services is used to manage the integrations. And, CP’s Integration Services does have the capability of introducing logic and math, as well as interacting with a Token Server to translate data elements.
That’s good news for SAP customers who choose to use CP to manage the integrations.
So while it’s simple and easy to simply copy over or clone your production data – in one fell swoop – from a security perspective, it introduces considerable risk. SAP Customers are welcome to use this approach, but it is discouraged, and as previously pointed out, SAP will not assume liability for breach exposure even though SAP generally offers comparable security of non-production environments.
What customers can and should do, is leverage Cloud Platform’s Integration Services and introduce a tokenization process in line with data integrations, in order to tokenize and secure the data properly. This is a great example of a business use case for Cloud Platform, rather than using any capability to copy over the entire environment. Yes, it does mean you have to do some extra work, but you must balance the amount of effort against the risk and the liability of exposure.
So what I’ve done is explain the risks of exposure of sensitive data, and how it can be reduced or eliminated. I have argued against the popular practice of simply copying data from a production to a non-production environment. Attackers will always take the path of least resistance, so let’s not make it easy for them.
The practice of copying entire environments from one to another should cease. This is not prudent and introduces considerable risk of exposure to sensitive data.
The simplest solution I see is to copy over production data, but tokenize that data, the process of substituting sensitive data with a non-sensitive value. This preserves the integrity of data but obfuscates the true value of the data, providing the mutual benefits of data integrity and accuracy while safely securing the data.
A token server can be used to manage and track the translation of the token values, but it too must be highly secured. This is an additional security consideration for the tokenization technique.
SAP’s Cloud Platform, typically used for integration services has within it the capability to create and manage a token database, and to introduce tokenization algorithms into the integrations.
It’s a little more work than a simple environment copy, but it does achieve the goal of putting no data into your non-production environment that has value.
The content presented herein represents the opinion of the author and does not necessarily represent the formal policy or practices of SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP has no obligation to pursue any course of business outlined in this presentation or any related document, or to develop or release any functionality mentioned therein.
This presentation, or any related document and SAP’s strategy and possible future developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at any time for any reason without notice. The information in this presentation is not a commitment, promise or legal obligation to deliver any material, code or functionality. This presentation is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This presentation is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or omissions in this presentation, except if such damages were caused by SAP’s intentional or gross negligence.
All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as of their dates, and they should not be relied upon in making purchasing decisions.