Skip to Content
Author's profile photo Oliver Kohl

1 + 1 = less than you expected

On Monday (12th of January) some of you saw the platform behaving in a strange way, with points not being given for actions during a certain time frame and overall slow performance of the platform. This is the first time we have seen this on a large scale on the new platform and this is a serious incident. Any form of data loss or disruption of data integrity is unacceptable, which is why I want to share with you some background on why this happened and what action we have taken.

Issues like this usually don’t happen out of regular operations and in this case there were two unusual situations that came together, which lead to the incident.

SAP Cloud ID Integration

SCN uses SAP Cloud ID (formerly SAP ID Service) for SSO and user profile information. SAP Cloud ID is the master for all profile information and any change that you perform in the SCN user profile gets synced back to it.

Last week SAP Cloud ID moved to a new data center infrastructure, with zero downtime or disruption for SAP Cloud ID. Due to this move, some specific rules on our hardware firewall kicked in and prevented any API call from SCN into SAP Cloud ID. The result was for one that profile changes could no longer get synced back from SCN correctly, which is unfortunate but from my perspective not a big deal. Each SCN profile gets synced at login with SAP Cloud ID, so worst case is that some changes got overwritten with the next login.

A result of this API calls getting blocked was that the platform internal event queue on each app server got locked, while waiting for the API call, which lead to the event queues filling up. You noticed this by seeing the read state in the Communications page not updating in time, which got noted in the support forums (Notifications icon for “Communications” stream not updating (again)). While working on identifying the root cause for this issue, another incident happened.

DDoS from Tor Servers

For the first time on the new platform we saw a planned DDOS attack on our infrastructure. Someone was using the Tor anonymity network to send a high amount of requests for forum posts, with the intention (I have to assume) to bring down our platform. The attack lasted from Sunday, 11th till Monday 12th, until we were able to block the requests. During the attack we saw an increase in incoming requests of up-to 30 times our normal load and as a result we saw our app server thread pools filling up to the point that no more threads where available on most app servers, which resulted in some users getting timeouts when accessing SCN. Tor is providing a full list of all IP addresses of any exit nodes, which helped us to block any requests from hitting the app servers.

SCN_Load_by_Tor.png

While keeping the service online with the load from the DDoS attack, our operations team also had to restart a couple of the app servers. What we didn’t realize at this point was that the in memory event queues were filling up due to the blocking API call into SAP Cloud ID, and that an app server restart would flush these into /dev/null. With that, the already executed (trans)actions like creating a blog post or document during that time frame neither show up in the personal activities, nor got transferred to the Gamification backend for reputation calculation.

Around 4pm CET on Monday all access from Tor got blocked, which stabilized the system and brought it back to expected performance levels. At around noon on Tuesday, the Firewall adjustments became active and from there on the filled up queued on several app servers needed a couple of hours to get processed, which is why some activities and point assignments took very long to show up.

Aftermath and Actions

For a certain path in the processing chain actions they are only stored in memory, and due to the server restart these events were lost unfortunately. Any data loss or impact in data integrity at a larger scale is unacceptable, and I’m sorry that this happened. We have taken actions to ensure this won’t happen anymore.

The fact that a blocking API call actually has any effect on the event queue is getting analyzed and needs to be fixed, so that we won’t see this issue anymore. It is unclear if this is due to our custom integration with SAP Cloud ID or if this a shortcoming of the underlying platform, but for now I have to blame our customization.

Also we consider the Tor network a valuable service and don’t like to see it blocked long term. We checked Tor traffic before the attack and saw some regular usage. In order to enable Tor access again, we are looking into implementing a request blocker that only gets activated when hitting a certain threshold, so that regular Tor users will be able to still use SCN.

Please also have a look at the post by Audrey Stevenson on this topic: Recent Issue with SCN Not Recording Point Gains and Losses

Assigned Tags

      22 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Matthias Steiner
      Matthias Steiner

      That's what I would I call a prime example of transparent communication! Yeap, sometimes things hit the fan and get ugly... #shithappens - however, what's more important is how you react and what you do to ensure it won't happen a second time!!!

      Kudos to Oliver & team!

      Author's profile photo Susan Keohan
      Susan Keohan

      Hi Oliver,

      I second what Matthias has said - stuff happens, and dealing with it, and reporting to your stakeholders in a clear manner are the key actions that we all know need to be taken.

      So, as a stakeholder, I thank you. 

      Sue

      Author's profile photo Oliver Kohl
      Oliver Kohl
      Blog Post Author

      Thanks! Happy to have you as a stakeholder.

      Author's profile photo Susan Keohan
      Susan Keohan

      Wait, am I a stakeholder or just a pesky user?  😆

      Author's profile photo Steffi Warnecke
      Steffi Warnecke

      The created content wasn't lost and you guys are working on a fix, so that won't happen again. What more can I want?

      Thank you for the transparence now and the fast reactions during this issue. That's a great way of handling this. 🙂

      Author's profile photo Ludek Uher
      Ludek Uher

      Beautiful explanation. We can only wish that everyone was as clear and transparent.

      Many thanks to you Oliver and the whole team!

      - Ludek

      Author's profile photo Oliver Kohl
      Oliver Kohl
      Blog Post Author

      The irony in this is that I probably reach Topaz level because of this blog post...

      Author's profile photo Steffi Warnecke
      Steffi Warnecke

      Good thing, the points counting stuff is working again then. 😀

      Author's profile photo Colleen Hebbert
      Colleen Hebbert

      Hi Oliver

      mmm... how many system issues to make Emerald?

      Topaz is deserved for transparency. Mistakes happen (everyone one of us on SAP has probably contributed to some form of incident for someone..sad to say) but without root cause analysis and taking ownership for them there is little chance of avoiding them again in future. Glad to hear it'll get fixed

      Pity you had multiple issues contributing to this but major incidents like these usually have more than 1 contributing point of failure.

      Regards

      Colleen

      Author's profile photo Oliver Kohl
      Oliver Kohl
      Blog Post Author

      Thanks Colleen! I hope to have some better stuff to talk about pretty soon.

      Author's profile photo Jason Lax
      Jason Lax

      You did. 😎

      Author's profile photo Former Member
      Former Member

      Hi Oliver and team!

      Thank transparency and concern in communicating to all the problems that occurred.

      Thank you!

      Karen Rodrigues

      Author's profile photo Michael Appleby
      Michael Appleby

      Is this related to the problem which started on January 7th or perhaps 6th?  Re: Reading not recognized by system

      Thanks, Mike

      SAP P&I Technology RIG

      Author's profile photo Oliver Kohl
      Oliver Kohl
      Blog Post Author

      Yep, same problem Mike.

      Author's profile photo Michael Appleby
      Michael Appleby

      Now I get to be a good poster and award points!

      Thanks Oliver!

      Author's profile photo Jürgen Hartwig
      Jürgen Hartwig

      Unauthorized

       
        

      Access to this place or content is restricted. If you think this is a mistake, please contact your administrator or the person who directed you here. 

      Author's profile photo Jürgen L
      Jürgen L

      You are not a moderator, hence you cannot follow this link as it goes to the moderator space.

      Author's profile photo Jelena Perfiljeva
      Jelena Perfiljeva

      Hmm, I thought this would be about binary system... 🙂

      Thank you for sharing, Oliver, your blogs are always pleasure to read. Combining open communication with learning experience - it doesn't get any better than this!

      Author's profile photo Jürgen L
      Jürgen L

      does it happen again right now?

      Author's profile photo Dibyendu Patra
      Dibyendu Patra

      Yes. From yesterday night.

      Author's profile photo Christopher Solomon
      Christopher Solomon

      You danged ol' script kiddies! Get off my lawn!!!! *shakes cane vigorously* 😛

      Author's profile photo Michael Appleby
      Michael Appleby

      And it is still going on right now.

      Cheers, Mike

      SAP Technology RIG