1 + 1 = less than you expected
On Monday (12th of January) some of you saw the platform behaving in a strange way, with points not being given for actions during a certain time frame and overall slow performance of the platform. This is the first time we have seen this on a large scale on the new platform and this is a serious incident. Any form of data loss or disruption of data integrity is unacceptable, which is why I want to share with you some background on why this happened and what action we have taken.
Issues like this usually don’t happen out of regular operations and in this case there were two unusual situations that came together, which lead to the incident.
SAP Cloud ID Integration
SCN uses SAP Cloud ID (formerly SAP ID Service) for SSO and user profile information. SAP Cloud ID is the master for all profile information and any change that you perform in the SCN user profile gets synced back to it.
Last week SAP Cloud ID moved to a new data center infrastructure, with zero downtime or disruption for SAP Cloud ID. Due to this move, some specific rules on our hardware firewall kicked in and prevented any API call from SCN into SAP Cloud ID. The result was for one that profile changes could no longer get synced back from SCN correctly, which is unfortunate but from my perspective not a big deal. Each SCN profile gets synced at login with SAP Cloud ID, so worst case is that some changes got overwritten with the next login.
A result of this API calls getting blocked was that the platform internal event queue on each app server got locked, while waiting for the API call, which lead to the event queues filling up. You noticed this by seeing the read state in the Communications page not updating in time, which got noted in the support forums (Notifications icon for “Communications” stream not updating (again)). While working on identifying the root cause for this issue, another incident happened.
DDoS from Tor Servers
For the first time on the new platform we saw a planned DDOS attack on our infrastructure. Someone was using the Tor anonymity network to send a high amount of requests for forum posts, with the intention (I have to assume) to bring down our platform. The attack lasted from Sunday, 11th till Monday 12th, until we were able to block the requests. During the attack we saw an increase in incoming requests of up-to 30 times our normal load and as a result we saw our app server thread pools filling up to the point that no more threads where available on most app servers, which resulted in some users getting timeouts when accessing SCN. Tor is providing a full list of all IP addresses of any exit nodes, which helped us to block any requests from hitting the app servers.
While keeping the service online with the load from the DDoS attack, our operations team also had to restart a couple of the app servers. What we didn’t realize at this point was that the in memory event queues were filling up due to the blocking API call into SAP Cloud ID, and that an app server restart would flush these into /dev/null. With that, the already executed (trans)actions like creating a blog post or document during that time frame neither show up in the personal activities, nor got transferred to the Gamification backend for reputation calculation.
Around 4pm CET on Monday all access from Tor got blocked, which stabilized the system and brought it back to expected performance levels. At around noon on Tuesday, the Firewall adjustments became active and from there on the filled up queued on several app servers needed a couple of hours to get processed, which is why some activities and point assignments took very long to show up.
Aftermath and Actions
For a certain path in the processing chain actions they are only stored in memory, and due to the server restart these events were lost unfortunately. Any data loss or impact in data integrity at a larger scale is unacceptable, and I’m sorry that this happened. We have taken actions to ensure this won’t happen anymore.
The fact that a blocking API call actually has any effect on the event queue is getting analyzed and needs to be fixed, so that we won’t see this issue anymore. It is unclear if this is due to our custom integration with SAP Cloud ID or if this a shortcoming of the underlying platform, but for now I have to blame our customization.
Also we consider the Tor network a valuable service and don’t like to see it blocked long term. We checked Tor traffic before the attack and saw some regular usage. In order to enable Tor access again, we are looking into implementing a request blocker that only gets activated when hitting a certain threshold, so that regular Tor users will be able to still use SCN.
Please also have a look at the post by Audrey Stevenson on this topic: Recent Issue with SCN Not Recording Point Gains and Losses
That's what I would I call a prime example of transparent communication! Yeap, sometimes things hit the fan and get ugly... #shithappens - however, what's more important is how you react and what you do to ensure it won't happen a second time!!!
Kudos to Oliver & team!
Hi Oliver,
I second what Matthias has said - stuff happens, and dealing with it, and reporting to your stakeholders in a clear manner are the key actions that we all know need to be taken.
So, as a stakeholder, I thank you.
Sue
Thanks! Happy to have you as a stakeholder.
Wait, am I a stakeholder or just a pesky user? 😆
The created content wasn't lost and you guys are working on a fix, so that won't happen again. What more can I want?
Thank you for the transparence now and the fast reactions during this issue. That's a great way of handling this. 🙂
Beautiful explanation. We can only wish that everyone was as clear and transparent.
Many thanks to you Oliver and the whole team!
- Ludek
The irony in this is that I probably reach Topaz level because of this blog post...
Good thing, the points counting stuff is working again then. 😀
Hi Oliver
mmm... how many system issues to make Emerald?
Topaz is deserved for transparency. Mistakes happen (everyone one of us on SAP has probably contributed to some form of incident for someone..sad to say) but without root cause analysis and taking ownership for them there is little chance of avoiding them again in future. Glad to hear it'll get fixed
Pity you had multiple issues contributing to this but major incidents like these usually have more than 1 contributing point of failure.
Regards
Colleen
Thanks Colleen! I hope to have some better stuff to talk about pretty soon.
You did. 😎
Hi Oliver and team!
Thank transparency and concern in communicating to all the problems that occurred.
Thank you!
Karen Rodrigues
Is this related to the problem which started on January 7th or perhaps 6th?
Thanks, Mike
SAP P&I Technology RIG
Yep, same problem Mike.
Now I get to be a good poster and award points!
Thanks Oliver!
Unauthorized
You are not a moderator, hence you cannot follow this link as it goes to the moderator space.
Hmm, I thought this would be about binary system... 🙂
Thank you for sharing, Oliver, your blogs are always pleasure to read. Combining open communication with learning experience - it doesn't get any better than this!
does it happen again right now?
Yes. From yesterday night.
You danged ol' script kiddies! Get off my lawn!!!! *shakes cane vigorously* 😛
And it is still going on right now.
Cheers, Mike
SAP Technology RIG