On Monday (12th of January) some of you saw the platform behaving in a strange way, with points not being given for actions during a certain time frame and overall slow performance of the platform. This is the first time we have seen this on a large scale on the new platform and this is a serious incident. Any form of data loss or disruption of data integrity is unacceptable, which is why I want to share with you some background on why this happened and what action we have taken.

Issues like this usually don’t happen out of regular operations and in this case there were two unusual situations that came together, which lead to the incident.

SAP Cloud ID Integration

SCN uses SAP Cloud ID (formerly SAP ID Service) for SSO and user profile information. SAP Cloud ID is the master for all profile information and any change that you perform in the SCN user profile gets synced back to it.

Last week SAP Cloud ID moved to a new data center infrastructure, with zero downtime or disruption for SAP Cloud ID. Due to this move, some specific rules on our hardware firewall kicked in and prevented any API call from SCN into SAP Cloud ID. The result was for one that profile changes could no longer get synced back from SCN correctly, which is unfortunate but from my perspective not a big deal. Each SCN profile gets synced at login with SAP Cloud ID, so worst case is that some changes got overwritten with the next login.

A result of this API calls getting blocked was that the platform internal event queue on each app server got locked, while waiting for the API call, which lead to the event queues filling up. You noticed this by seeing the read state in the Communications page not updating in time, which got noted in the support forums (Notifications icon for “Communications” stream not updating (again)). While working on identifying the root cause for this issue, another incident happened.

DDoS from Tor Servers

For the first time on the new platform we saw a planned DDOS attack on our infrastructure. Someone was using the Tor anonymity network to send a high amount of requests for forum posts, with the intention (I have to assume) to bring down our platform. The attack lasted from Sunday, 11th till Monday 12th, until we were able to block the requests. During the attack we saw an increase in incoming requests of up-to 30 times our normal load and as a result we saw our app server thread pools filling up to the point that no more threads where available on most app servers, which resulted in some users getting timeouts when accessing SCN. Tor is providing a full list of all IP addresses of any exit nodes, which helped us to block any requests from hitting the app servers.

SCN_Load_by_Tor.png

While keeping the service online with the load from the DDoS attack, our operations team also had to restart a couple of the app servers. What we didn’t realize at this point was that the in memory event queues were filling up due to the blocking API call into SAP Cloud ID, and that an app server restart would flush these into /dev/null. With that, the already executed (trans)actions like creating a blog post or document during that time frame neither show up in the personal activities, nor got transferred to the Gamification backend for reputation calculation.

Around 4pm CET on Monday all access from Tor got blocked, which stabilized the system and brought it back to expected performance levels. At around noon on Tuesday, the Firewall adjustments became active and from there on the filled up queued on several app servers needed a couple of hours to get processed, which is why some activities and point assignments took very long to show up.

Aftermath and Actions

For a certain path in the processing chain actions they are only stored in memory, and due to the server restart these events were lost unfortunately. Any data loss or impact in data integrity at a larger scale is unacceptable, and I’m sorry that this happened. We have taken actions to ensure this won’t happen anymore.

The fact that a blocking API call actually has any effect on the event queue is getting analyzed and needs to be fixed, so that we won’t see this issue anymore. It is unclear if this is due to our custom integration with SAP Cloud ID or if this a shortcoming of the underlying platform, but for now I have to blame our customization.

Also we consider the Tor network a valuable service and don’t like to see it blocked long term. We checked Tor traffic before the attack and saw some regular usage. In order to enable Tor access again, we are looking into implementing a request blocker that only gets activated when hitting a certain threshold, so that regular Tor users will be able to still use SCN.

Please also have a look at the post by Audrey Stevenson on this topic: Recent Issue with SCN Not Recording Point Gains and Losses

To report this post you need to login first.

22 Comments

You must be Logged on to comment or reply to a post.

  1. Matthias Steiner

    That’s what I would I call a prime example of transparent communication! Yeap, sometimes things hit the fan and get ugly… #shithappens – however, what’s more important is how you react and what you do to ensure it won’t happen a second time!!!

    Kudos to Oliver & team!

    (0) 
  2. Susan Keohan

    Hi Oliver,

    I second what Matthias has said – stuff happens, and dealing with it, and reporting to your stakeholders in a clear manner are the key actions that we all know need to be taken.

    So, as a stakeholder, I thank you. 

    Sue

    (0) 
  3. Steffi Warnecke

    The created content wasn’t lost and you guys are working on a fix, so that won’t happen again. What more can I want?

    Thank you for the transparence now and the fast reactions during this issue. That’s a great way of handling this. 🙂

    (0) 
  4. Ludek Uher

    Beautiful explanation. We can only wish that everyone was as clear and transparent.

    Many thanks to you Oliver and the whole team!

    – Ludek

    (0) 
    1. Colleen Hebbert

      Hi Oliver

      mmm… how many system issues to make Emerald?

      Topaz is deserved for transparency. Mistakes happen (everyone one of us on SAP has probably contributed to some form of incident for someone..sad to say) but without root cause analysis and taking ownership for them there is little chance of avoiding them again in future. Glad to hear it’ll get fixed

      Pity you had multiple issues contributing to this but major incidents like these usually have more than 1 contributing point of failure.

      Regards

      Colleen

      (0) 
      1. Jürgen Hartwig

        Unauthorized

         
          

        Access to this place or content is restricted. If you think this is a mistake, please contact your administrator or the person who directed you here. 

        (0) 
  5. Jelena Perfiljeva

    Hmm, I thought this would be about binary system… 🙂

    Thank you for sharing, Oliver, your blogs are always pleasure to read. Combining open communication with learning experience – it doesn’t get any better than this!

    (0) 

Leave a Reply