1 + 1 = less than you expected

oliver · ‎01-15-2015

On Monday (12th of January) some of you saw the platform behaving in a strange way, with points not being given for actions during a certain time frame and overall slow performance of the platform. This is the first time we have seen this on a large scale on the new platform and this is a serious incident. Any form of data loss or disruption of data integrity is unacceptable, which is why I want to share with you some background on why this happened and what action we have taken.

Issues like this usually don't happen out of regular operations and in this case there were two unusual situations that came together, which lead to the incident.

SAP Cloud ID Integration

SCN uses SAP Cloud ID (formerly SAP ID Service) for SSO and user profile information. SAP Cloud ID is the master for all profile information and any change that you perform in the SCN user profile gets synced back to it.

Last week SAP Cloud ID moved to a new data center infrastructure, with zero downtime or disruption for SAP Cloud ID. Due to this move, some specific rules on our hardware firewall kicked in and prevented any API call from SCN into SAP Cloud ID. The result was for one that profile changes could no longer get synced back from SCN correctly, which is unfortunate but from my perspective not a big deal. Each SCN profile gets synced at login with SAP Cloud ID, so worst case is that some changes got overwritten with the next login.

A result of this API calls getting blocked was that the platform internal event queue on each app server got locked, while waiting for the API call, which lead to the event queues filling up. You noticed this by seeing the read state in the Communications page not updating in time, which got noted in the support forums (Notifications icon for "Communications" stream not updating (again)). While working on identifying the root cause for this issue, another incident happened.

DDoS from Tor Servers

For the first time on the new platform we saw a planned DDOS attack on our infrastructure. Someone was using the Tor anonymity network to send a high amount of requests for forum posts, with the intention (I have to assume) to bring down our platform. The attack lasted from Sunday, 11th till Monday 12th, until we were able to block the requests. During the attack we saw an increase in incoming requests of up-to 30 times our normal load and as a result we saw our app server thread pools filling up to the point that no more threads where available on most app servers, which resulted in some users getting timeouts when accessing SCN. Tor is providing a full list of all IP addresses of any exit nodes, which helped us to block any requests from hitting the app servers.

While keeping the service online with the load from the DDoS attack, our operations team also had to restart a couple of the app servers. What we didn't realize at this point was that the in memory event queues were filling up due to the blocking API call into SAP Cloud ID, and that an app server restart would flush these into /dev/null. With that, the already executed (trans)actions like creating a blog post or document during that time frame neither show up in the personal activities, nor got transferred to the Gamification backend for reputation calculation.

Around 4pm CET on Monday all access from Tor got blocked, which stabilized the system and brought it back to expected performance levels. At around noon on Tuesday, the Firewall adjustments became active and from there on the filled up queued on several app servers needed a couple of hours to get processed, which is why some activities and point assignments took very long to show up.

Aftermath and Actions

For a certain path in the processing chain actions they are only stored in memory, and due to the server restart these events were lost unfortunately. Any data loss or impact in data integrity at a larger scale is unacceptable, and I'm sorry that this happened. We have taken actions to ensure this won't happen anymore.

The fact that a blocking API call actually has any effect on the event queue is getting analyzed and needs to be fixed, so that we won't see this issue anymore. It is unclear if this is due to our custom integration with SAP Cloud ID or if this a shortcoming of the underlying platform, but for now I have to blame our customization.

Also we consider the Tor network a valuable service and don't like to see it blocked long term. We checked Tor traffic before the attack and saw some regular usage. In order to enable Tor access again, we are looking into implementing a request blocker that only gets activated when hitting a certain threshold, so that regular Tor users will be able to still use SCN.

Please also have a look at the post by oddss on this topic: Recent Issue with SCN Not Recording Point Gains and Losses