#FR13 Post Mortem
I really wanted to write about a different part of current SCN, something a little bit more on the light side. But this has to wait for another day.
On Friday we had the longest unplanned SCN downtime, at least since I recall. We were down for 9 hours, which is of course unacceptable and should never happen.
The root cause was an outage of a central network storage, which provides file storage for the cache server. Its purpose is to serve cached content to the applications on each app server, and by that take the load from the database. When the cache server failed Friday around 3pm CET, all requests from the app server had to be served by the database directly, which is simply not possible with our current traffic. DB reqests stalled, application threads where blocking, and as a result none of the applications servers could respond to HTTP requests anymore. That’s when you see the “Sorry Page”.
We have only one cache server, and that was an intentional decision. We had tried a two and three cache server cluster years ago, but the current version of our collaboration platform doesn’t support this very well and we never fixed it. We had trust into the infrastructure, but should have been better prepared. On top there was slow and ineffective communication between the teams involved, and key people simply not available to get on a machine. 9 hours is a long time and there are many gory details, some I don’t want to talk about and of course many I just don’t know enough about yet.
Looking at the chain of effects, it becomes clear that team communication needs to improve. Communication protocols between the teams need to be better documented, especially escalation points. That and clear communciation channels to key contacts for infrastructure services that we are depending on, will improve the escalation handling of these type of outages.
But key here of course are the single points of failure in the architecture of current SCN. All functionality is served by a single application. And if that application is effected, all of SCN is not accessible anymore.
This is one of the reasons why we choose to build the new SAP Community in a different style, with multiple core collaboration applications and supporting microservices. Even if a core application fails, other parts of the Community will still be available. And each single application and service is designed for high resilience. We have in most areas zero downtime deployments available, and also made sure that application servers are better distributed into multiple data centers.
Such a distributed architecture on the other side introduces new complexities. We have done already quite a bit of monitoring, and got in contact with key people in the infrastructure and operation teams of the core services that we use, like SAP HANA Cloud Platform.
I hope that current SCN will serve us well next week, when SAPPHIRE NOW is going to happen. It will be the last SAPPHIRE NOW with this incarnation of SCN that we are all going to use. At SAPPHIRE NOW we will show you the new SAP Community and give everybody full access ot it, even if still in Beta state. There are still plenty of bugs, inconsistencies and missing integrations, but we think it is worth showing it to you. I for sure do!
Apologies for the mistakes that I and we as a team make. Sometimes this happens when you push boundaries and they won’t be the last ones. But otherwise it wouldn’t be so much fun.