Anatomy of a Go-Live
I was originally planning to write my first blog post on the new SCN about its platform architecture, integration aspects of the landscape, our migration approach and the delivery model. But with the events of the last days I came to the conclusion it would be better to give you some background information on what happened in the last five days.
Sunday
The production system has been live now since middle of December and a small group of beta testers (~1000 SAP Mentors, Moderatos and SAP employees) had been invited to get a first insight and provide feedback. During that time the focus was mostly on the overall performance of the system and getting bugs Twice a week we where doing deployment to the platform and ran delta migration of the latest blog and forum posts published on the old platform. The idea was to have lots of regular small go-lives every week in order to test the process and avoid any surprises for the actual go-live weekend.
The last delta migration ran this Sunday, and beside some specific tasks for this special weekend, we had done the whole procedure many times before. The final switch was to enable the content redirect from the old platform and enable the contributor permission for the community members. Around 5 pm CET the core SAP IT team was in our connect room and we made the switch.I didn’t expect any big surprises but it was still a very exciting moment to see the first community members exploring the site. We did the basic checks on the system, watching the traffic and load on the app and DB server, checking the logs but in the end it was still Sunday.
We knew it would be the next morning, when it would get interesting. We all headed home, getting some sleep to be fit for the real traffic.
Monday
And it got interesting. The Indian part of the community was already on the system for about 3-4 hours and with increasing load we saw the servers going down. This was unexpected, because we had done very excessive load tests before. We knew what traffic to expect and our load tests had shown that the system was able to even handle peak situation. What was going wrong?
We got heap dumps of the dying app servers and found out quickly that a certain widget, which was integrated in basically every content page was the trouble make. The so called MoreLikeThis widget was showing a list of similar content then the one currently shown. In order to identify the similar content object, expensive queries against the meta data of the page was necessary, and this was eating up memory and CPU resources required for the normal site operations. The integration via AJAX did not trigger these queries during our load test and because it was not considered typical end user activity, we did not have it covered in our load test.
Once we had identified the issue and the system was down anyway, we removed the widget and started the app servers again.
The traffic came back and we did not see the system recovering very well. The DB was under permanent load, mostly due to specific long running queries, which where easily identifiable as queries from the “Browse Content” tab each of the space. Even more interesting was that probably 99% of the queries were unique and these could not be properly cached by the DB. The specific design of the content browse functionality is implemented for high load and large tables, in our case ~2.1 million content objects. But the high number of unique queries was what got the DB into trouble, not being able to deliver the results quickly enough and with up to 50 and more request in parallel, the app servers where waiting either for query results or ran into timeouts.
What was happening and why? We knew that we had to expect a very different usage pattern with the new platform. With all the social capabilities we were expecting a much high amount of write accesses, even if it would be simple likes or ratings. We had all of this covered in our load test. What we did not expect was that the users would utilize the content browser for searching content, but this was exactly what happened. The content browser provides abilities to filter by content type (blog posts, discussions, document etc.) but also by custom text entry. Also we had some variants as part of our test suite, we where not prepared to have this high amount of different variants, which made the actual difference performance wise.
With the widget removed the app servers where doing OK, but due to the DB timeouts the actual transactions where failing, resulting in front end errors and a very bad end user experience.
The situation improved slightly with the near caches getting better utilized and by that reducing the need to query the DB. The goal was at this point to keep the app up and running and monitor it to get a better insight what was going wrong. Even though with these problems we saw high load on the system and very frequent, often multiple contributions per minute while getting hit hard with read requests.
Tuesday
Over night the system was very shay and our great 24/7 support had to restart single app servers on a regular base due to unresponsiveness and a resulting removal from the cluster. in the late morning the problems came more critical again, but again not as bad as the day before. More requests could get served from the local or near caches, which gave the DB a bit air to breath.
But this was when the sign out issues started to grow. We saw increasing feedback fro the community about loosing their sign in session, of course paired with overall bad performance of the application. But every app server was able to keep up for multiple hours and rolling restarts or dying servers allowed us to keep the cluster intact and server the requests as much as possible.
The goal in this situation was basically to make it not worse but allow the community to use the platform as much as possible and in parallel gather more information about the issues we where facing. The one thing we wanted to avoid by any chance was a longer downtime. You have to keep in mind that thousands of consultants, partners and employees rely each day on the information made available on SCN and the tremendous feedback of their regular contributors. Even though the overall experience was really bad, we still saw a high traffic with both read only requests but also contributions. Members were already working on the system and posting to the discussion forums, posting questing and providing answers. Blog posts started to show and the content team was working with the community on getting more content into it.
We were looking into getting as much information as necessary to plan the next steps to mitigate the identified issues. Still the DB was under unusual high load and investigation into short term remedies like adding server resources and table indices were done, but also considerations to upgrade the DB version. The app servers were losing the connection to the near caches before dying. With support from our platform vendor we considered to move from from a HA setup with three cache servers to a single one, which sounds counter intuitive at first. In a there cache server setup, each app server has to write his updates to at least two caches, and by reducing it to only one cache server, we would actually take load from the app servers. We were also considering restricting the browse functionality in the front end to reduce the number of unique queries.
With a lot of opinions and assumptions at hand we started our load tests, this time with different DB and cache server configurations in order to verify that the possible options would not get us more into trouble as we were already.
In parallel we identified the most critical functional issues which we would be able to ship with any hot fix required. The issues should be critical to the current user experience but not introduce any new issue, so only small fixes have been considered.
In the late afternoon we had our regular call with the vendor support team to discuss with them our finding, share the won information and discuss possible proposal. WIth their support located in the US they were able to work on a patch which would allow us to configure the number of thread connection to the cache server, which was one of the bottlenecks identified.
Wednesday
Our 24/7 support team located in Dresden is taking care of the operation side while the rest of the team in the European timezone is getting some sleep. App server restarts were again required during the night but the last two day allowed us to develop a pattern to have the system up and running in a controlled state, although the issues were ongoing.
One of the pressing issue has been the irregular sign outs from the platform, which we initially attributed to the ongoing errors. With further analysis of the logs we saw a correlation between certain spikes in the app server load and corresponding 502 error which lead to user sessions being transferred by the load balancer to a different app server. The load spikes happened in regular intervals which resulted in high memory usage and a stalling JVM. During that time the load balancer was not getting a response from its health check test and by that assumed the server wasn’t responding anymore, so it would redirect the request to a different app server in which the user session wasn’t known and a sign in again required.
The regularity of the spikes raised the suspicion that this could be triggered by some background task. The most possible candidate was a task responsible to collect newly created content and feed it into the internal search index. This index has a size of roughly 25 GB and the unfortunate architecture of its implementation requires to have a copy on each single app server. Most operation on this index are fairly memory intensive and this special activity lead to a temporary out of memory situation and multiple successive full GC run. We gave it a try, disabled this task on a single node, restarted the server.
The logs for this server did show a substantial lower number of 502 errors even after 60 minutes, so we rolled out the configuration to the rest of the cluster followed by a rolling restart.
It worked! [Twitter feedback]
It did not solve all our issues but the application was finally closer to our expectations performance wise and with the sign out issue (mostly) solved, the end user experience increased a lot. Finally it was possible to get some work done on the system without massive frustrations from getting kicked out of the system while posting a forum reply or drafting a blog post.
Thursday
There are still issues with the application which don’t allow us to go into a normal operations mode. The servers are performing much better but still a couple of restart were required during the day. There is much more work ahead.
One thing we did was to separate the DB schema responsible for the activity stream and moved it to a different server. We also reduced the number of cache server which reduced the network traffic and took some load from the app server by not having to sen the content to multiple cache servers.
The load on the DB is still way to high and we are working in close relationship with the vendor to improve the responsible queries, which are build dynamically be the application. This required code changes tailored to our specific setup (which is not unusual, but more on that in a different blog post).
Work has started on fixes for pressing issues that we saw in your feedback like redirect loops at login, problems with updating the user profile page and usernames not getting changed during the initial login to the app. These fixes will get rolled out soon, including further performance improvements. We are listening to your feedback and we are aware of some missing blog posts or forum threads. All the reports get collected and we will provide the missing content soon.
It has been a rough start and I apologize for a sub par experience that you had in the first three days. Projects of this size are always a challenge in some form or way and this one is no exception. Thanks a lot for your amazing support and the tremendous feedback that you have provided during this week. Please continue to send us bug reports to our SCN Bug Reporting [Read-only] space as this is invaluable feedback.
I’ll keep you updated.
Oliver
Hi Oliver,
thank you for that post, It's nice to get some insight.
Predicting user behaviour is always very difficult.
Cheers
Adi
Oliver,
Thank's for the explanation. It's valuable to see this launch from your IT architecture/delivery POV. Given what was required, the progress is amazing. I appreciate your dedication to the community.
Regards,
Gail
Great blog! I'm sure the Community will read all the details avidly. For me, part of the SCN Team, it was a good update as well.
Many thanks Oliver for all you've done for this project.
Laure
Fantastic blog; rich with insight. I don't know how you found the time to write it with everything else going on, but we sure do appreciate this perspective.
oliver, grats for (finally) get the system running and thanks for sharing this insight.
Too many caches to supply from a single point ... interesting problem.
And to think you've been quietly logging everything...
I'll save it for the kids.
I left out the ugly details to make it kids friendly.
You can sell the ugly details when your book deal goes through 😉
Great job, guys!!!
Thank you for the detailed insight. Seems that you need real browser load testing before you can re-enable the MoreLikeThis widget. The Open Source tool Selenium can perhaps help you there. There are also cloud base testing services. Check out http://seleniumhq.org/support/.
Hey Gregor,
we use Selenium but only for automating functional tests. For load testing we use JMeter and it is fairly easy to add this use case to the scenario via HTTP requests, which we will do once we decide to bring the widget back.
Hi Oliver,
Alle Achtung! great sneak peek into the DB side of things, but referring to a DB vendor suggests it's not SAP's DB which leaves us to guess: IBM or ORCL, but i don't think this is a question that can be answered in public, oder?
Hey Gregory,
i guess, Oliver meant the software vendor and not the DB vendor. I can not imagine that SAP is using Oracle for its own platform, that would be too funny 😈
Regards
Stefan
good answer even though it doesn't answer the question
I think I should follow-up in the same vein, with an honest analysis of the good & bad in the performance tests we did. For all of the "science" in it, it is basically an art IMHO, and given the limited time and knowledge (about the future especially...), is based on many assumptions. We did clear up a number of huge issues before launch, but obviously missed some others.
I think transparency here is paramount - sharing, listening, improving (I got my personal list, now show me yours)
Elad, please do so. Your input to the projects has been awesome. It was you who found the background task dependency, which made the biggest impact. Sharing your incredible insight with the community would be a great addition. Looking forward to it.
Hi Elad,
would be greatly interested in more details about the testing.
Best regards
Gregor
Elad: I, too, vote for you to share your experience in a blog. I wanted to share some of the performance test data and process in a blog, but some of the content is internal-only, and I'm not qualified as you are to decide what to share and to explain it properly.
Would be done, though I hesitate to say too much with authority before the major problems are 99% gone.
However, I am indeed planning to start contributing for the long term. I'm probably gonna start Sunday - it's always a quiet day without any calls with Europe... (touch wood).
This what i like about SCN team as a whole they are fairly transparent which is simply commendable. Considering the size of this project its huge huge huge.... Issues like these happen but i can see the improvement happening on an hour to hour or day to day basis.
Thanks for the awesome blog. Please keep up the good work you all guys are doing. I really appreciate it.
We have come so far and will reach soon to the place destined:)
Well done
Nabheet
I should start to tweet everything that I eat... 😉
Thanks Oliver. I'm fascinated by the architecture issues for systems as large as SCN, and this sort of insight in what went wrong and how it was diagnosed is really interesting. I hope I remember some of the lessons learned for my next major go live, albeit on a much, much smaller system!
Thanks for all the work guys. Having to deal with issues like this very much under public scrutiny must have been really difficult. You've done a great job. Now you need a serious go-live party... 😀
Thanks Steve for the kind words and the reminder on the go-live party, completely forgot about that 🙂
Hi Oliver
Thanks for posting this blog. It's nice to know which challenges were at hand and how they were handled. I hope this brings more understanding from the community for the situation in a broad way.
Kind regards
Tom
Hi Oliver,
Thanks for this blog, it's very interesting. We need SDN HANA to have everything in memory 🙂
Peter
Some people in tech. forums always say "Well, with LINUX that <insert bad thing here > wouldn't have happened..."
So I guess we should rephrase that now to HANA 😏
Btw, in actuality a lot of stuff is indeed already cached in memory. That's how we could serve millions of requests (from after the Monday crash till the Wednesday big fix), while having the DB threatening to hurt itself due to load. Now THAT is amazing to me...
>So I guess we should rephrase that now to HANA 😏
We need HANA on LINUX just for sure 🙂
Thanks for sharing Olivier.
This just goes to show the importance of having a plan and organisation that is ready to handle a major GoLive. There are always surprises regardless of how well you prepare, and the major question is how you handle those known unknowns (ugh.. quoting that guy). Based on the report you've given it seems you had a plan that included all the necessary stakeholders, and thats what makes me confident that you'll solve these initial issues.
Good luck
Hey Dagfinn, welcome to the new SCN! Glad to have you here and thanks for the support. Projects of scale like this one are multi team efforts and require close coordination and this team I'm working in is the most skilled, most experienced and most committed one I was ever allowed to work with. Man, I'm so lucky to be at this place here, it's all worth the effort!
Cheers,
Oliver
Are you planing to run DB part on HANA any time soon? 😉
what type of app servers is this running on?
Yes, Denis, we do plan to insert HANA into the SCN infrastructure; timing not yet confirmed as we needed to get this initial launch in place first.
Thanks Oliver for sharing what's under the hood. Nice to see the transparency being convey to the community. Very enjoyable to read even though it is 2000 words. 😉
Hi,
really nice blog. It confirms how hard it is to proper do load testing. It seems like the only safe option is to replay/redirect live traffic into test system and see how it goes. Companies like Netflix do this before they switch to new version. Obviously, in this case it was not possible because site is brand new.
Cheers
I wish we would have had the option to do a phased roll out but technical constrains like keeping the thread / message IDs in place for one to one redirects made a big bang go-live necessary.
Thanks for this blog, it's very interesting
Hello Oliver,
thank you very much for sharing some insights. That is how it should be - name it as it is.
The only part i am wondering about is this - you wrote "getting bugs Twice a week we where doing deployment to the platform and ran delta migration of the latest blog and forum posts published on the old platform".
A lot of blog content is just destroyed due HTML code issues, missing content or images. How could this not be noticed by the delta migrations twice a week? I tried to restore one of my blogs (because of a SDN member needed its content for his work), but it is nearly impossible.
What is the strategy about the blog content now? Any roadmap for that?
Anyway great job by the infrastructure team - the platform is usable since a few days and the experience is getting better every day (except the blog part!).
Regards
Stefan
Hi Stefan,
in hindsight we should have put more effort into the blog content migration but these issues have not been noticed during the migration.
The mitigation plan is to make the missing content / pictures available to the original authors and support them with getting their blog posts manually restored. There is already thread Lost blog content where this is discussed. Not sure if there is a central document to gather these issue but there should be one even if it does not exist yet.
Best,
Oliver
Great blog! I wish I had seen this before writing my most recent one!
I've always felt it's better for a company to come out and in a manner of speaking, "air it all out". It goes a long way in making clients/customers confident in knowing that you understand the issues, have a plan, and acknowledge the problems.
Congrats on getting a handle around the issues and resolving them. I'm looking forward to next week and getting back to the true business of SCN once again.
By the end of next week I'll probably forget your team even exists! Which is actually a great measure of an infrastructure and support team!!
Keep up the good work.
FF
Thanks for your feedback FF, appreciate it. I read you blog post and understand yours and other people frustation. SCn being mostly a tech community I thought the given information in my blog post would fall on fruitful ground. Hopefully these issues will be forgotten very soon and people start embracing the enormous potential of the new platform.
Keep your feedback coming, even if negative. It keeps us going to make the best community platform, one that this aweome community deserves.
Best,
Oliver
Thanks for the insightful blog post Oliver. I am pleased with how the site is looking and I know you guys are working hard on getting the little kinks out of the system.
Cheers,
Nigel
Oliver
Thanks for sharing. As a user, the performance of the site was terrible early in the week (always being logged out) but by Thursday it was good and usable. So well done for fixing the problems so quickly. And in the long term, the new platform should be much more usable and useful and user-friendly.
John
That's exactly what we think as well. Technical problems are way down, but let's fix and improve features until this place is great. We'd be glad to have your input (and bug-reports), and in the meantime please explore all the big & small things already in place.
Elad
Hi, Oliver.
Thanks for sharing your professional insight into what happens in the project.
It's a very important input.
Regards,
Ella.
Hi Oliver,
Appreciate the transparency, and letting us know the technical details.
Thanks to everyone in your team for resolving the major bugs.
Probably it's high time we provide additional insight from behind the scenes...I'll get to work on it from my end and provide some view into the zeitgeist 🙂
Thank you very much for this Oliver. It's quite interesting to read about the challenges the troubleshooting and the solutions.
I'm thinking of writing my own blog on what could have been done differently - before the launch. Not to be negative, but just a simple wish we had done such and such. A learning experience for all 🙂 .
In any case, I am very encouraged by the work that you and all of the Jive team has done to date. Together, we'll make this thing a site that will be envied by all
- Ludek
People who disparage the new scn platform doesn't really know the effort being carried out behind by the scn authorities. This blog is a must read for them. Special thanks to you.
Kesav
Hello Oliver,
Thanks for this very interesting blog !
Good to see that even SAP can have a hard time going live 😉 ! Just kidding ! No offense guys !
Some may notice the failing parts only. I see it as a very good work in terms of reactivity and troubleshooting.
Keep up this very good work. I like the new SCN.
Steve.
Hello Oliver,
Thank you very much for your blog and new SCN. It's looking very good, If there is any advantages to end user with new SCN compare to old one.
Best Regards,
Harish.Y
Great blog! I felt like I was almost in the trenches with you "fighting the good fight" ! haha Thanks most of all for the honesty and openness. Very nice!