Security Chaos Engineering and Security Engineering Amid Chaos: Cloud-native Cyber Resilience
For those new to cloud security it is easy to find well-intentioned recommendations what you should do, but it is much harder to find clear frameworks and standards on how to implement and operate effective controls and processes in cloud landscapes. Traditional IT operations and cybersecurity frameworks are still rarely cloud-aware, while direct experience is exchanged between peers, through conferences, social media, podcasts and blog posts.
That informal sharing of cloud-native security practices is great, but it also not very accessible. Regular cloud breach reports and ‘state of cloud security’ analysis consistently show that generally organizations are failing to secure cloud landscapes at the basics. Talking to peers securing their cloud landscapes, we find we have often come to similar cloud-native practices, but we all did it the hard way: largely in isolation from each other, reacting to circumstance, finding solutions to challenges and building upon successes. Nobody wants to discuss their challenges until they have overcome them.
We argue with colleagues and auditors who question our approach, talk about cloud-native mindset and the benefits of infrastructure- and policy-as-code, explain how distributed, immutable and ephemeral (DIE) approaches support better security outcomes at less effort, deal with the organizational challenges that cloud transformation brings, but that takes a lot of energy that could be better spent. What we need is a framework or instruction book that gets us on the same page.
Security Chaos Engineering
Thankfully, Kelly Shortridge with Aaron Rinehart have done so with Security Chaos Engineering: Sustaining Resilience in Software and systems (O’Reilly, 2023). In this book, Shortridge has managed to tie software engineering and DevOps, chaos engineering out of Site Reliability Engineering (SRE), and cloud-native technologies and practices together into a complete development and operations lifecycle guidebook for cloud-native cyber resilience. For experienced practitioners it puts practices you may already have adopted in context and a coherent framework. For less cloudy developers, DevOps engineers, SREs and security professionals, it is a handbook to build processes and quality-, resiliency-, and security programs around, and helps explain what their cloud-native colleagues are always talking about. Shortridge and Rinehart did us all a great favor, whether we may be far in our cloud maturity journey, or just at the beginning. For the former it will bring confirmation that what they were already doing was the right thing to do. For the latter it will save a lot of pain and wasted effort.
As previously with Autonomic Security Operations: 10x Transformation of the Security Operations Center (Google, 2021), I read the book with constant flashes of recognition. While SAP does not generally perform chaos engineering experiments in an organized way (yet), let alone security chaos experiments – a key aspect of her cyber resiliency framework – virtually everything else lines up with our own cloud security programs and practices, support structures and strategy, built up over the past four years.
A special twist came in Chapter 9, with David Lavezzo’s Experience Report at Capital One, as the media storm around an unusual cloud security breach there in 2019 caught the attention of SAP customers, SAP’s CEO and Executive Board, CSO and CISO, and thereby provided the space and support for our own cloud security improvement program. Appropriately, this article is more an additional case study to Chapter 9, than a review of the book.
Security Engineering Amid Chaos
In a discussion with Kennedy Torkura, a key thought leader in Security Chaos Engineering (SCE), he was surprised that SAP took the same approach of SCE, but without security chaos experiments. I explained that we largely stumbled into what we call SecDevOps alongside DevSecOps through circumstance. It struck me then that what we did was not security chaos engineering, but security engineering amid chaos.
SAP started experimenting with public cloud around 2015. By 2019 this had grown to over 2,000 cloud accounts. A Cloud Security Posture Management (CSPM) tool had been deployed but was simply collecting alerts and growing with the size of the landscape. Public cloud use was on annual quadratic growth, and the organization was not set up to follow-up on alerts, as became instantly apparent as an emergency remediation program was launched in August 2019.
Developer teams had access to the CSPM tool, but there was no enforcement. Distribution of alerts immediately ran up against asset management and data challenges. Teams were driven into the cloud and suddenly were responsible for secure configuration of network and infrastructure other teams used to take care of, while getting familiar with new cloud services and APIs, of an ever growing size. We were in the middle of deploying the second CSPM tool. This needed a far bigger program than chasing teams on alerts.
A team of pioneers from what is now called Multicloud Security Architecture and Engineering – part of the cloud infrastructure services and operations organization and arguably a platform engineering team as described in Security Chaos Engineering – and SAP Global Security, supported by the Chief Security Officer and Chief Information Security Officer got together to sketch out a plan including organizational preventive controls, detective controls, and developer enablement and support in the form of guides, templates and secure resource orchestration, based on cloud-native and DevOps approaches. To ensure we were effective, we committed the program to measurable security outcomes: an 80% reduction in the number of alerts relevant to CSPM, regardless of the growth in the number of resources deployed. By the end of 2020, the alerts were reduced by 96%.
Any Computer System is Inherently Sociotechnical
As Shortridge says, “any computer system is inherently sociotechnical – humans design, build and operate them”. SAP is an organization of many Lines of Business (LoBs) deploying many computer systems. The COVID pandemic started early 2020. In October, SAP announced our Next-Generation Cloud Delivery program of accelerated cloud transformation, migrating a new set of solutions and teams into the cloud. By the end of the year it was clear our second CSPM tool did not meet requirements and we committed to develop our own. While the engineering involved in that was significant, among ever increasing growth and constant change in cloud services, the greater challenge was the stress secure cloud transformation placed on the teams developing, migrating, deploying and operating cloud services.
From the start in 2019, we had run a “Hyperscaler Security Office Hours” to support the remediation effort. Over time, together with the dialogues taking place during the rest of the week, this became increasingly a focal point to bring the security policy team – or as Shortridge calls it, the ‘blunt end’ – and the LoB teams – the ‘sharp end’ – with the Security Engineering team often playing an important bridge function.
This accelerated during the development and ultimate operations around the home-grown CSPM tool. The development of controls to scan the landscape dictated practical approaches, bringing the Security Policy team out of the ‘ivory tower’, so to speak. For a control to work at all, there had to be clear pass-fail conditions that the Security Engineering team – cloud-native specialists focusing on security engineering and operations, along DevOps approaches – could write policy-as-code controls around. This created a feedback loop that made the policy more practical and clear to the teams that had to comply with it.
The Office Hours specifically extended this dialogue to the LoB teams, providing an open forum to ask questions, challenge scan alerts, controls or policies, discuss potential exceptions, identify possible false positives or candidates for a central service to take care of common issues, and balance security risk with the operational burden to remediate.
The policy owner and the security engineering teams are accountable weekly, directly to their internal customers, while the business unit teams are reminded of their security compliance posture and obligations.
Developer and DevOps teams have their own priorities and pressures. Yet, they are the only ones that can develop, deploy and operate secure systems. Even if they are engaged and agree with remediating alerts, they need to be provided with the space to do so by their team leads and managers. Security and compliance teams can write policies and keep track of failures
Through central visibility of findings shared in different formats to different audiences, including reporting and dashboards across the organizational hierarchy, board area COO offices could cascade targets and make it clear security is a priority. We created multiple feedback loops that involve the entire business unit in ensuring CSPM alerts and vulnerability findings are followed up on, while allowing dialogue at the right levels when the cost or effort of security controls may not weigh up against the security risks they mitigate, or do not result in effective security outcomes.
The approach has been very successful. Our CSPM compliance to policy runs around 99%. The deployment of a pioneering CNAPP solution allows us to increasingly become more mature in risk-based prioritization of alerts and where to focus limited available attention. The scale of the landscape forces teams to increasingly sophisticated CI/CD pipelines to manage.
The approach has served as a model, and is now part of the security strategy as we continue on our secure cloud transformation. Those already familiar with Security Chaos Engineering will recognize familiar themes, despite our lack of security chaos experiments. There is enough chaos in the sociotechnical system as it is.