Everything Is (Mostly) Under Control

oliver · ‎03-23-2012

In my years of consulting work that I've done for large and small customer I've come to the conclusion that being as transparent as possible is usually the right way to approach all parts of a project. This is not only true for your colleagues at work but also when talking to the customer. Having an open word about the problems in front and being transparent was the right thing to do.

I had the same philosophy in mind when I wrote my blog post last week Anatomy of a Go-Live, trying to give you a comprehensive overview of the state of the migration and what kept us busy in the days before. With SCN having its root in being a network for developers my guess was that getting a look into the technical side of the project and its challenges would let you better understand how to interpret the then ongoing platform issues and how to deal with it. What I did not expect was the overwhelming positive response on this post and all the amazing feedback it got. Thanks a lot for the incredible support and all the positive comments!

Enough on that, back to work...

Performance & Stability

What you all experienced most was the not top notch subjective performance of the system and unfortunately it took us some time to find the right nobs to turn to get it working. After disabling the troublesome background task Wednesday last week, we still saw over the days to follow a very high load on the DB. It was a multitude of very similar but unique queries getting data from a single DB table. The table stores a small set of information of all content objects (documents, blog posts, and discussion threads) in SCN in order to improve lookup speed when using the browse content tab in each topic space, like this one: http://scn.sap.com/community/about/content. The real expensive queries would ask for data in a very high LIMIT range, like 10.000 upwards and putting into relation with the pagination functionality it was hard to image which type of sane end user would page through the content set that high. Also with the amount of queries we got of this type, the suspect was again some other task going rogue in the background. It took us till Wednesday this week to finally figure out what is was.

The browse content tab allows you to filter the result set by a number of different inputs, ranging from object type (blogs, documents, etc.) to tags and filter by text. If you check for yourself on the publishing date of this post, you will find one is missing. We had to temporarily remove the "Filter by Text" option. The way it works that once you enter any arbitrary text, it would start fetching the first thousand or so entries based on the other filters, hand their IDs over to the internal Solr index and query them for the text content. In case it did not find enough results to populate the browse page, it would repeats the procedure as long as necessary and only come back if either enough results haven been found or nothing was left to query. Can you image dozens of users using the text filter to search for content in their spaces of interest?

We are currently working with the vendor support to get get a sufficient schema architecture in place sufficient for the volume of our community. In our case this specific DB table holds as of today roughly 2.1 million entries and it is growing constantly with each new discussion thread created and each new blog post published. We are very much aware of the importance of the "Filter by Text" functionality and we are committed to bring it back as soon as possible.

All of this wouldn't have been possible without a tremendous team in place and I want to point out a couple of people responsible for getting it done.

There is our operations guy valentin.weckerle. He was constantly investigating, testing new query combinations, reconfiguring the DB and documenting all combinations to get closer to the solution step by step. He deserves a big clap on his back for all the long hours without running out of patience.

shmuel.krakower is a load testing expert and taking care of getting all the test scripts in place in order to reflect in our testing environment all the diversity and complexity of the productive usage patterns.

And last but not least elad.rosenheim, who joined just a couple of weeks ago our team full time but his deep understanding of JEE application architecture and his constant nagging to question our conclusions was a big contribution to get where we are.

Cause and Effect

Prior to Wednesday evening you still experienced sometimes slow performance, occasional logouts errors while using the content browser. Usually this would vary depending on the daytime and how many users were on the system concurrently. Actually Wednesday afternoon this week we probably had the worst day since launch as we saw stalling DB connections due to full usage of the whole connection pool while the running queries compete on CPU and IO resources, resulting in query times of hundreds of seconds. It got so worth that we had to kill long running read queries manually in order to get the DB to process the rest of the queued requests properly. But since applying the mitigation patch we haven't seen a situation even remotely close to this.

From what we see this should be now under control. The DB load dropped dramatically and the applications servers are much better able to respond quickly to all requests. One piece missing is that the app server are still operating memory wise on an unhealthy upper limit. This is because of having redundant Solr indexes on each of the application server nodes, which consume a big proportion of the available heap memory. And you don't want to raise the head memory to much because then full GCs end up massive operations for the JVM and you have the next problem at hand. The solution (that is at least the plan) is to separate the Solr index into a single server instance at first and have it later in a HA configuration replicated on multiple nodes to make it fail save. But this is targeted to get implemented, tested and rolled out for the next week or two.

Migration Aftermath

A couple of words on the quality of the content migration. We hear complains about the quality of the migrated blog posts. Some images are missing and formatting especially around code samples is in a bad shape. All content is still available to use, nothing is lost! We are working on getting the image issue fixed and we'll probably have something available for next week. Unfortunately the old blog instance allowed any dirty HTML and CSS hackery but the new system is much more restrictive and these content cleanup is best done manually. We still have all the HTML markup available on request, so if any content got lost please make us aware and we can provide you the original content. I've setup the document Report Your Missing Blog Content where you can enter the URLs to your blog posts. I encourage you toplease take care of your own content yourself. In case you are one of the top contributors and you're looking for support to help you in this tedious effort, please be aware that as long as the blog posts stays in your personal blog, it can only be edited by yourself. Only once you move it into a topic space, moderators like myself or the SCN content team are able to support you. The upside is that once we have finished a clean migration of all blog content, we have it in a much better format and the next migration will be less problematic. Promise!

In case you have notes some strange formatting, especially with line breaks in the old forum posts, we will take care of this once we find some spare cycles. We plan to have an automated task to take care of this in place but I personally consider this much less of a problem because the content is otherwise still fine.

The Functional Side

Most most of the pain slowly fading away we have now started looking also into functional bugs that we have identified and beside out internal issue tracking system we have also opened the SCN Known Issues List where we add issue found by the community. You can report any issues you find in the (unfortunately quite active) The SCN Bug Reporting [Read-only] space, where we constantly monitor your finding and answer any questions. I encourage you to send us any all issues you see, but also make sure that the issue hasn't been reported prior by checking the SCN Known Issues List. With a unique ID for every identified issue in place we are able to also let you know which one have been fixed and deployed to the platform. Therefore we have created the SCN Release Notes, where we post updated with each release and the changes which have been implemented. You will not only see bugs but also mitigation activities like the removed "Filter by Text" control in there. This is all part of the transparency we want to bring to SCN in order to keep you up to date on what improvements we bring to this exciting new community platform in the coming month and years.

Thanks for reading this long blog post and enjoy your weekend.

Oliver