Multicloud Security Compliance Scanning: Complianc...

JayThvV · ‎04-17-2022

Running a Home-Grown Security Compliance Solution

In a previous blog from about a year ago, I already spoke about our move towards compliance-as-code for security compliance scanning in SAP. Since then, the solution has involved into a policy control development, scanning and data pipeline operation that enables compliance, remediation and enforcement efforts through the organizational hierarchy. Multicloud Security Compliance Scanning now performs entirely the Cloud Security Posture Management function of scanning resources in all of SAP's nearly 12,000 cloud accounts for compliance with SAP Global Security policies and hardening procedures across AWS, Azure, GCP, Alibaba, AWS China and Azure China. The scans return around 8 million results per scan run, between "passed" (meets compliance, the vast majority), "failed" (it does not), and "skipped" (resources for which some established and approved exception exists for whatever reason).

In this blog, I'd like to list some of the operational benefits we gained from taking this function in-house, in this solution based on the Chef InSpec engine. Because my team developed everything in the solution around this engine, the technical aspects and the operational processes in the organization that were either strengthened or developed around it often bleed into each other seamlessly. Our team is called SecDevOps, and the solution is a manifestation of that through developing the solution, deploying it and operating it.

The diagram below gives an overview of the Multicloud Security Compliance Scanning solution.

Multicloud Security Compliance Scanning

As we will see, the operational benefits are both technical and organizational.

Covering all the Controls in the Policy

SAP Global Security defines our policies, based on industry benchmarks, certification requirements, contractual obligations and good security practices. These are often close, but don't necessarily fully map to default controls built into commercial CSPM SaaS services. These services allow you to customize the controls up to a point, but this may be dependent on their rules engine. The tool may also simply not have the capability to support a particular control.

This gets compounded by SAP running on 4, arguably 6, cloud providers - and the control coverage for the different platforms is not necessarily the same. For instance, we now have a number of controls for all platforms we need to support, instead of only on those for which support for the control existed in the commercial tool.

Since the Chef InSpec controls are written in Ruby, that is, an actual programming language, and the Chef InSpec resource packs are open source, we can write the controls to the policy and therefore provide full coverage of the policy. To give a sense of scale, when we switched from the previous service to the current solution, we doubled the number of controls scanned for amount high severity controls. That is, roughly half of the policy previously could not be verified. Now, we can be confident that all of the policy is being scanned for.

Clarity in the Code, Clarity in the Policy

The theory of compliance controls-as-code is good. By implementing the control, the pass and fail conditions should be clear, and therefore the control should be less ambiguous. The practice is even better, because what we find is that it becomes a two-way street.

Below you see a diagram of the release process. The SecDevOps teams develops the controls to the policies defined by the SAP Security Defensive Architecture team, who review the code and sign off that it meets the policy.

Multicloud Security Compliance Scanning Release Process

As we develop or update controls, we run into ambiguities in the policy text, or practical obstacles how the policy intent would be expressed in code or can be verified via the cloud provider's API. Through the effort itself of translating the policy into code, not only do we validate whether the control meets the policy... but also whether the policy needs to be adjusted to the control code.

By matching policy to code and code to policy, the security policy itself becomes more practical and clear.

Clarity in Operations

This clarity in the policy and the control-as-code dramatically speeds up operations. Previously, it was never quite clear how the control in a commercial tool scanned the resource, as that tends to be proprietary. So, what exactly is the fail condition? This often took several weeks to find out in case an internal team questioned the finding.

This confusion has been eliminated altogether. When a business unit challenges a finding, we can pull up the control source code and the policy, and match it with the configuration of the resource. If the resource doesn't match the control-as-code and policy expectations, the resource needs to be remediated or an exception requested. If the control and policy don't match, the control code needs to be adjusted. If they do, either the control doesn't do what it intends to do and requires investigation.

But it can also clarify quickly: Is the business unit arguing the control code... or the policy? The former is a technical discussion and investigation with the SecDevOps team. The latter a is discussion with the SAP Global Security team. So it accelerates triage in day-to-day operations and where to appropriately redirect the business unit for follow-up.

Consistency between Development and CI/CD tooling and Post-Deployment Scans

From the start, we made a 'consumer container' available to our internal developer teams. This allows teams to run compliance scans against their own cloud accounts whenever they want, independent from the central scans. This can be integrated in CI/CD pipelines and testing cycles ("shift-left"), or used in remediation efforts in order to confirm that a remediation indeed fixed a particular violation we caught in the central scans ("shield-right")

The consumer container contains exactly the same ruleset as the central scans, so there is complete consistency between developer tooling and the post-deployment compliance scans.

Remediations included in the Control Code

Our previous commercial CSPM solution contained remediation recommendations for the alerts it generated. The community often relied on those in the past years, and we knew that we had to develop remediation text - if you're telling teams they did something wrong, we should expect that they'll ask for advice on how to correct it.

We want to encourage an 'infrastructure-as-code' approach, so the core of our remediations are the Terraform fragments we use to generate the known-good resources that should pass our security controls during the control development cycle. In addition, since we know that not all business units in the company can be expected to only remediate via IaC, we research cloud provider documentation to write clear recommendations that meet our expectations for corrections via the cloud provider admin console and test to make sure they meet the pass condition of the controls.

The text of the remediations (console and Terraform) are included in the source code of the controls themselves, and therefore are part of the SGS code review. They are also written to the controls and therefore specifically indicate how to satisfy the policy control. We then use automation to publish Control Details documentation that teams in the company can access, and we can reference in security operations and interactions with the business unit teams.

Clear Evidence Trail for Independent Audits

We run our compliance scans on all cloud accounts in the company, whether for internal or external workloads. The majority of the landscape runs customer-facing workloads, all subject to certification audits of at least ISO and SOC2, and PCI-DSS where that applies. Such certifications are issued after 3rd party audits and this includes requirements for security compliance scanning.

There is an entire evidence trail for such audits, from the policy and policy updates to the development and update of the security controls. Since all code and documentation is in our internal Git repository and reviews are done via pull requests, the entire process is clearly documented and can be reviewed by the auditors in as much depth as they might want. Also here, the clear lineage and relationship between the policy and the compliance control-as-code removes ambiguity and simplifies the audit process.

Engagement and Validation with the Internal User Community

Since August 2019, we have been running a weekly Multicloud Security Office Hours call with the internal community around security public cloud landscapes. This is an hour call each Tuesday during both European and Asia-friendly time, as well as a Westcoast US and Asia version later in the day so we can reach teams wherever they are. In this call we give a weekly status update on our security compliance posture, as well as provide access to the SGS policy team and the SecDevOps teams to discuss and debate policies, procedures and processes. It allows for continuous interaction with the business units security experts, and allows them to raise possible false positives issues.

This community was kept informed on our development process before the solution was launched, including early testing with the consumer container. Before the solution went live with the full policy control set, we conducted ~5 weeks of extensive validation testing with the Office Hours community. We shared scan results with the teams and actively solicited concerns and false positive reports. This proved crucial both in understanding the specific ways different developer team had deployed their landscapes and in gaining the confidence of the community in the scans.

The speed and agility in fixing confirmed false positives if anything brought on more engagement. An early report for a control that generated many false positives in the landscape was investigated and a bug fix created within ~24 hours (really about 20). When the community heard that, this encouraged business units to work with us and make the control set much more robust at launch. Moreover, it raised the confidence of the internal teams in the quality of the alerts, and that if they found more issues, we were committed to investigate them, provide a response quickly, and if necessary fix the control code.. or even the policy.

Together with the interaction between the SecDevOps and SGS Defensive Architecture policy team, this constant interaction with the internal security community makes the public cloud infrastructure policy one of the most tested and validated, frequently updated and practical security policy in the company.

Increasing Contextual Awareness and Sophistication

This continuous community interaction has lead to an increasing contextual awareness and sophistication in the policies and resulting controls. As teams investigate findings, they bring to us contextual information about their application, how it is deployed, etc. The investigation of possible false positives or attempts by business units to fix problems that don't get picked up by the control can lead to policy adjustments or added conditions or checks in the controls.

This can go into depths that can be difficult to all remember without looking at the control code itself! One of the more simple examples cover exposure of blocklisted ports to the public internet, where the control on Azure now checks whether a default security policy is in place, whether a custom security policy is in place and whether the configuration allows ingress on specifically named protocol ports.

Since the controls are ultimately just Ruby code, we can always add in additional conditions as requirements and policy judgements dictate.

Adjust to Cloud Provider API and Admin Console UI changes

The public cloud provider landscape is not at all static, and APIs, services and documentation changes all the time - and aren't necessarily all well explained. There are often "classic" and "new/current" versions of the same service or function and this can cause significant confusion. The interaction with the community, collaboration with the team implementing preventive controls and continuous partnership with the SGS Defensive Architecture team ensures we stay on top of such changes and adjust controls and policies as needed.

Since the resource packs for Chef InSpec are open source, we don't have to wait for a vendor to react - we can put the investment into these resource packs ourselves to provide support for new cloud APIs or other changes in the platform. This is of prime business benefit to us, but of course also helps others in the Chef InSpec community as well.

Multiple Data Access Channels

SAP is a large and complex organization with 100,000+ employees and 10,000s of those are somehow involved or impacted in some way by their teams operating public cloud landscapes. This runs from developer, DevOps and operations teams with their hands on the knobs, so to speak, to managers and business leaders, governance and compliance teams, and senior leadership.

In order to cater to this variety of roles and personas, we developed multiple data access channels to meet the different needs and responsibilities of the organization. I already mentioned the consumer container above, which is developer/DevOps focused, and the first diagram shows the data ingestions into SAP's central SIEM. We also every Monday send out a Multicloud Security Compliance report containing current open alerts to all account owners (named individuals in the company accountable for everything that happens in a particular cloud account) and security officers (named individuals in the company responsible for following up on security issues in a particular cloud account).

The mailer is deliberately sent out a few days before the Thursday data exports, which are organized by the company board area and business unit hierarchy, and are used by central security and operational teams to drive compliance processes within their organizations. These data exports also feed the central status reporting that goes to executive leadership, and are discussed weekly on the Office Hours calls. Interactive dashboards provide further insight into the relative compliance posture across board areas, business units and teams, environment types and cloud provider.

Finally, access is provided via API to allow teams (especially larger business units) to automate the scan results into tickets for their organization, distribute them to the right internal teams, or create their own internal dashboards to track progress.

Self-Adjustment to Organizational Change

It is a fact of life in large corporations that it undergoes continuous change. SAP is no different. For a variety of reasons, business units and teams get re-organized on a regular basis, and we typically see a few of those changes each year. We take this into account during the scan data enrichment process, as the cost center associated with the account is checked daily for the organizational hierarchy associated with that in a central system of record.

As the cloud account information updates, so does the organizational hierarchy associated with each compliance scan for each account. Without having to do anything at all - no manual regrouping of accounts, now operational effort to adjust of any kind - the data exports, status reporting and dashboards self-adjust to the new organizational reality.

Exceptions

Inevitably, in a company as large and complex as SAP, and a pretty strict policy regime that applies to everyone, there are going to be exceptions. SAP has a clear security exception process in place that ensures any exception request is appropriately documented and indicates mitigating security controls, must be signed off by the BISO of the business unit and for any high severity policy item must be judged by SAP Global Security.

Each granted security exception has an associated ticket and is logged in our Exception database. Once it is, the scans will "skip" resources that are subject to the exception - and therefore the exception is applied at the time of the scan itself, i.e at source. Therefore, all data access methods derived from the scan data already have the exception applied.

This also ensures that exceptions are properly recorded. Because unless it is, it doesn't end up in the exception database and therefore as far as the compliance scans are concerned, it doesn't exist.

"Eh, but...": Caveats

As the Head of Multicloud Security Operations, I am very proud of my team of SecDevOps Engineers as well as the SGS Defensive Architecture team for making this all real. I am also thankful for the support of our CSO and CISO, as well as that of the business leaders in the company to ensure we don't just have a technical solution but that it ties in tightly with the internal security and compliance processes to ensure action is taken and we stay on top of any alerts.

I would be remiss, though, if I didn't mention some caveats:

Being able to do this takes a team. The SecDevOps team has fifteen team members, in large part to support this security compliance function. Not everyone is of the scale of SAP that they can afford that

Without close partnership between the SecDevOps and security policy team, this would not have worked as well as it did. The continuous dialogue between teams has to be a two-way street to ensure the controls meet the policy intent and the policy adjusts to practical considerations and be achievable

Engagement with the user community of security experts in the business units is critical, as well as support within the organization, especially because...

The number of alerts in the landscape has risen dramatically, due to a combination of deliberate policy choices (for instance, each open blocklisted port is now its own alert, rather previously where one or any would just create one alert for the resource), controls that didn't exist in the previous toolset, and refinement and sharpening of policies as we continuously drive for improvement

This creates more work and stress on organizations and teams that are already busy with many other things

You therefore have to have the internal support structures such as Office Hours, enablement tools and processes that make compliance easier to achieve or development of central services or preventive controls that solve challenges at company-wide level

Tools don't fix security, people do. And this is a not a set-and-forget SaaS service. It requires commitment from the organization as a whole, and a team of cloud security experts dedicated to dig into the cloud APIs, dig into documentation, dig into any false positive reports and adjust controls as required, and share their knowledge with the internal community. It takes an engaged security organization constantly willing to justify their judgements and face the community and the consequences of their decisions.

You need the executive support, but to actually make things happen, you need security champions running security improvements within their business units to cascade any alerts down. Many have grown into our most impactful supporters, and we owe them especially to do whatever we can to make the controls as high-quality as possible.

Therefore, this is not for the faint of heart and requires a serious commitment.

Final Thoughts

I deliberately wanted to highlight operational benefits, so certain additional ones didn't get mentioned. The solution is cost-effective and since we operate it ourselves, we have a variety of ways to manage operational run cost. The ability to contribute ourselves to the Chef InSpec Open Source process has operational benefits, but of course these contributions themselves cascade down to InSpec user community and we are delighted that we thereby support security compliance efforts beyond SAP itself.

Looking ahead, towards a technology future that is increasingly cloud-native, open source software undoubtedly will play a big role in cybersecurity operations. The CNCF security & compliance landscape is an interesting mix of commercial and open source options. From a pure personal perspective, I think it is increasingly prudent to seriously include open source options for security functions and evaluate them alongside commercial options, if you aren't already doing so.