#FR13 Post Mortem
I really wanted to write about a different part of current SCN, something a little bit more on the light side. But this has to wait for another day.
On Friday we had the longest unplanned SCN downtime, at least since I recall. We were down for 9 hours, which is of course unacceptable and should never happen.
The root cause was an outage of a central network storage, which provides file storage for the cache server. Its purpose is to serve cached content to the applications on each app server, and by that take the load from the database. When the cache server failed Friday around 3pm CET, all requests from the app server had to be served by the database directly, which is simply not possible with our current traffic. DB reqests stalled, application threads where blocking, and as a result none of the applications servers could respond to HTTP requests anymore. That’s when you see the “Sorry Page”.
We have only one cache server, and that was an intentional decision. We had tried a two and three cache server cluster years ago, but the current version of our collaboration platform doesn’t support this very well and we never fixed it. We had trust into the infrastructure, but should have been better prepared. On top there was slow and ineffective communication between the teams involved, and key people simply not available to get on a machine. 9 hours is a long time and there are many gory details, some I don’t want to talk about and of course many I just don’t know enough about yet.
Looking at the chain of effects, it becomes clear that team communication needs to improve. Communication protocols between the teams need to be better documented, especially escalation points. That and clear communciation channels to key contacts for infrastructure services that we are depending on, will improve the escalation handling of these type of outages.
But key here of course are the single points of failure in the architecture of current SCN. All functionality is served by a single application. And if that application is effected, all of SCN is not accessible anymore.
This is one of the reasons why we choose to build the new SAP Community in a different style, with multiple core collaboration applications and supporting microservices. Even if a core application fails, other parts of the Community will still be available. And each single application and service is designed for high resilience. We have in most areas zero downtime deployments available, and also made sure that application servers are better distributed into multiple data centers.
Such a distributed architecture on the other side introduces new complexities. We have done already quite a bit of monitoring, and got in contact with key people in the infrastructure and operation teams of the core services that we use, like SAP HANA Cloud Platform.
I hope that current SCN will serve us well next week, when SAPPHIRE NOW is going to happen. It will be the last SAPPHIRE NOW with this incarnation of SCN that we are all going to use. At SAPPHIRE NOW we will show you the new SAP Community and give everybody full access ot it, even if still in Beta state. There are still plenty of bugs, inconsistencies and missing integrations, but we think it is worth showing it to you. I for sure do!
Apologies for the mistakes that I and we as a team make. Sometimes this happens when you push boundaries and they won’t be the last ones. But otherwise it wouldn’t be so much fun.
Hi Oliver, thank you very much for the transparency here. Our partnership with you and your team has been excellent throughout the 1DX project, and this downtime, while unfortunate, isn't the important story: it's the new SCN, and I am also very excited about this - better, faster, stronger!
No worries. Some of us addicts need to be weened off SCN and go out into the sun on occasion. 😛 I for one appreciate all the hard work you and the whole team do.......just don't let it ever happen again!!!! *wags finger vigorously* HAHAHA
Hi Oliver,
Lots of questions came up from the user community. So thanks for the analysis of the Friday 13th bomb! Communications is often a problem when confronted with something new and unknown. Who to bring in and when can't always be anticipated.
Cheers, Mike
SAP Technology RIG
Thanks for the post, Oliver! We can all learn from such "fail shares".
It was super-annoying that downtime happened within prime office hours in the US (and I needed some step-by-step guides from SCN! 🙂 ), but I'm sure the teams in other time zones were even less happy with spending Friday night fixing this. My heart goes to you and your families, thanks for all your work!
Also, as Christopher said, we all need to get out sometimes, so that's exactly what I did. Maybe SCN should crash more often, eh? 😉
"working as expected"
Thanks a lot, Oliver. Ironically, when I wanted to read the post yesterday evening, guess what - SCN was down again. But then, I believe @PHDcomics got it just right: The Internet is down https://t.co/dVkBkHY4Z8 https://t.co/a13DMipkPK
Excellent blog, thanks for sharing some insight and really can't wait to get access to the new SCN now!
Great blog post Oliver, love that you keep it honest and tell it how it is - speaks volumes about your work attitude!
I'm excited that - starting today - the whole community can experience what your team has been developing for the last couple of months: SCN Open Beta - Open for Business
Its great to know that beta is available now : SCN Open Beta
Good work guys. I really like beta version of SCN and hope its keep getting better and better day by day..
Cheers and Congratulation to team.
-Yogesh
Hi Oliver
One of the best Root Cause Analysis responses to the customer (our community) that I have read. It's a collective who fails and never just one individual. I witness reservation in true honesty for RCA as people worry they will become a scapegoat rather than treat it as an opportunity to improve and prevent re-occurrence.
We can't learn or improve without truly admitting where fault lies. It's rarely one item alone that makes something happen. Communication is always a key and can always be improved.
Thanks for you and your team's work in getting the system back. I look forward to the new platform with less down time 🙂
Regards
Colleen
Oliver,
At first I thought it was a planned downtime, something related to the new SCN beta and aligned to SAPPHIRENOW. But later I noticed that it was something unplanned.
Thanks for sharing what happened. It brings light and transparence, as well as information for us - developers/tech nerds - who also are eager to understand why our favorite service went down.
Regards,
JN