Skip to Content
Author's profile photo Former Member

Load Testing as Science and Art

The aim of this post is not specifically to shed more light regarding what went wrong before the launch of SCN, but I do promise to get there. In fact, I hope to do much more: inspire some thinking about the making of load-tests for any big, complex system. I’ve been to a few of these, and many of you have been through this as well, I guess.

One thing that’s always noticeable to me either when I present my findings, or when I read of other’s experiences, is this aura of “OMG SCIENCE AT WORK! HYPOTHESES, ISOLATING VARIABLES, STATISTICS…FEAR ALL YE PRODUCT PEOPLE!”. I do admit it’s kinda satisfying as an engineer to bask in that light. However when looking at the details, there appears an intricate layer of reasoning, switchbacks, convenient omissions and the like which make it read more like a novel (and a bad one at times). Why does that happen?

The Unknown

One major reason, I think, is the actual amount of unknowns you’re facing. Even when you have the legacy of an existing system such as the old SDN, there appear numerous questions. Here are but a few – some are easy, some are hard.

  • The old system had of course the concept of replies to discussions, and now we added Likes and Shares on top – so how to estimate the number of such actions? How do these actions affect the number of replies? A safe bet here, which we actually took, is to leave the rate of replies as it is but add likes and shares by a factor of 4x. Why 4x? Because these are much easier tasks for the user to do than reply, and so we needed to have SOME factor there. Of course, one could argue, likes and shares would also increase the number of replies – because people would reply just for the sake of being liked. Here you could hopefully see the endless loop of discussion which may surround every little detail. So, you decide, this corner of the system really doesn’t matter all that much; You go for some nice factor, telling yourself that it’s all fine because you added a lot of load for SOME OTHER FEATURE.
  • And here’s another one: looking at the logs from the old system, we were quite surprised at the number of requests for RSS feeds. I’m not an avid fan of the technology myself, and I had to wonder how many of these registrations are actually “active” – how many users registered once to a blog/discussion and now forgot about it? In other words, how many would bother to register again in the new system, given the fact that the new system has more modern (and arguably better) functionality which serves a similar purpose: Followed Activity? For this corner of the system, we actually decreased the number of RSS requests compared to the old system, while making sure to set a really high rate for the All/Followed Activity page, which we already knew was quite heavy at times. In this process of bargaining, we always made sure the total number of page-views per hour would sum up to what we calculated as representative of a “busy hour in a busy day” – in the old system. Nice, but is this old hourly rate enough anyway? maybe not, but then you have to make your baseline SOMEWHERE and start loading from there.
  • The long-tail of content: Some content, such as Oliver Kohl ‘s blog, is more popular than others 😉 and so you look at your first draft of the test and think: maybe I’m actually making it too hard on that poor system…I mean, I’m randomly requesting for threads that no one would look for anymore, when actually it might be that 20% of threads are viewed by 80%, and maybe it’s even closer to 5% consumed by 95%! If that is indeed the case, my system is gonna smoke that test with its in-memory caches! Results would be wonderful…but if you’re like me, you just don’t take that path of glory. And you know you’re putting some non-realistic load here for the worse, because it evens out with somewhere else where you unknowingly made life way too easy on the system.
  • Product people don’t know any better than you: there are many respects in which they actually do know better (as painful as this is to admit), but suprisingly not in this case. I’ve had this time and time again: You can’t go to someone who’s an expert on functionality and demand some numbers. It’s hard to even get some realistic usage scenario from them. They just don’t think that way, they don’t have that info, and these nice little user-stories are totally made-up anyway – and they know it even better than you do. In the rare case they get TOO INTERESTED, however, you face the possibility of someone actually questioning your basic axioms all over again. Of course, you’re open to that, but as George Smiley once said in “Tinker, Taylor, Soldier, Spy”:
    “TO A POINT”.

I could go on in similar vein forever here, but a pattern does emerge I guess: There’s just a lot you don’t know, and nobody’s gonna help you. So, you try to strike a balance which FEELS right – to you.

You Don’t Have Enough Time – And This Will Never Change

An excuse? for sure, but also kind of a given for load-testing, because of the golden rule of load-tests: “By the time the system is mature & stable enough to test, it’s time to deliver already”. You can pathetically try to stress a work-in-progress, but would be hammered on all sides by the bugs, and cannot compare the previous week’s results to this week anyway. In all probability, you are also usually busy building that system (someone has to do it), while hoping that your solution does scale as planned, but you don’t really know except for some synthetic micro-benchmarks, which a lot of people won’t even bother to do.

This also relates to a rant about optimization: too many people seem to think that optimization is the choice between ArrayList and LinkedList when their list is 5 items long. They don’t realize that performance is usually borne out of architecture. If it’s well-built, then it can scale already or can be fixed to become so. As for our context, this probably means that during the happy development phase, many people won’t know how to code or what to test anyway when it gets to this dirty issue of performance.

My & Your Tests are Static, Reality is Not

Given the first rule which concerns the inherent lack of time, here is one thing that we tend to miss, and I think we missed it here.

You think you came up with some use-cases which describe “a day in the life” of your system. In reality, nonetheless, there is always change. Sounds like New-Age talk? Well, you better believe it. We based our scenarios on users that pretty much know how to get to stuff – as many did in the latter SDN days. On our launch, however, our users did not know how – for various justified reasons which have nothing to do with performance per-se. And so they wentto the content browser and clicked on just about every possible combination, with or without search terms, trying to FIND THAT CONTENT ALREADY. And so, a usability issue (which would both need better tooling on our side and might become less of an issue as many users become more accustomed to current navigation methods) became a performance issue – because it just happened to be generating lots and lots of unique queries that are really hard to cache on any level. Granted, this was not the only feature whose usage patterns we did not account for, but it’s enough to have just one which hurts – and then it doesn’t really matter that you found ten other such biggies just before launch (why just before launch?? see again: “You don’t have time” etc.)

You (and I) are Doing it Wrong

The final point, for today at least, is that even if you came up with a brilliant test set, you probably use tools in a way that doesn’t REALLY match the real world. One case in point: AJAX requests, such as the “More Like This” query on content pages which brought the system to its knees on its first day live. Unless you have a grid of computers at your disposal just aching to act as your test clients, it is much more feasible to just fetch the HTML content of pages instead of running in a real browser – which has this nice feature where it loads & runs not just static resources but all Javascript content. Instead, you look at the page and see what the “important” AJAX calls are, so you can just mimic these directly. When you miss one “important” call, as was the case here, it can blow over.

By the way, I don’t think there’s a magic bullet here: I’ve twice heard in the context of Cucumber/Capybara-based automated testing that one should use HtmlUnit instead of the default Firefox driver. You get a “real” browser core with JS, but don’t have to suffer the burden of a full browser – and all is super-fast and well. In these two instances, when the browser implementation was changed from HtmlUnit to Firefox just as a demonstration for me, the tests then failed immediately, at which point the response of all involved was: “So….anyway….”.

Before you get all depressed, I have to say that the picture is not that bad: if you work your way against all odds, you manage to somehow get to production with most of the big pain points already squashed. In a major site like SCN, where people actually care (and I appreciate that every day!), you get the rejects, live with them and fix the roadblocks as fast as you can, so we could argue about features again (which is the best possible state, I guess). Everybody knows that for internal deployments, as messy as they are, users would just have to live with it until the situation is fixed…(and so it’s sometimes never fixed). For us here, this is not an option of course.

Now, let’s see if the patches and fixes give all of you the experience expected. The error & response-time numbers show me an increasingly better picture, but as we now know… well, these are just numbers.

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Oliver Kohl
      Oliver Kohl

      This is from the source, excellent insight. Thanks for contributing here in this space and sharing your part of the SCN migration with the community. So good to have you on our team.

      Author's profile photo Detlev Beutner
      Detlev Beutner

      Hi Elad,

      Annoying as I am, I would like to add some remarks 😉

      1.) "You Don't Have Enough Time - And This Will Never Change" - Such sentences in my eyes / ears are one of the reasons why failures are repeated throughout the whole world again and again; people don't learn from failures, not from their own, nor from others. That frustrates me. But they could. A start would be: "We didn't have enough time - we knew before, and now we have shouted so loudly to the people responsible that their hair will fly backwards the next hundred years; let's keep the fingers crossed that this helps those people to remember to plan better next time." 😉

      2.) Of course, the sentence would be right if you discuss 100% perfection. But that's not the point. The start of new SCN was so far from 100%, it was a big pain, for all people involved.

      3.) "for internal deployments, as messy as they are, users would just have to live with it until the situation is fixed...(and so it's sometimes never fixed). For us here, this is not an option of course." I must have missed something. Really: If in the company where I am at the moment our internal (migration) release at the end of this year would look like the SCN migration, heads would roll. And I have to add: I would have some comprehension for this.

      4.) For the ajax requests missed: I think this needs a bit more concrete self criticism, because it's really easy to see with HttpWatch that this widget is one of the problematic ones; also the vendor of course should have helped more and should have some more knowledge to share. Anyhow, as outlined on my blog, I absolutely see that such a thing can happen; no offense meant (so far).

      So, altogether: Yeah, "**** happens", and I'm the first to sign this. I always very much differentiate if "****" happened by free will, deliberately, or by accident. In the latter case, it's not worth to try to find "guilty ones" - as long as the people "responsible" kick themselves (and try to improve the environment so that these thing's won't happen again).

      I think the migration of SCN is some kind of paragon for things which can be planned wrong, go wrong etc in a more or less serious huge project (at least concerning the amount of data and users). "We will always have too few time" is the wrong approach to get something out of the failures which happened 😉

      So, Elad, even with all the cristicism you might (and should ;-)) read out of my lines, in the end I would love to see that poeple like you get more time to be able to deliver quality. I'm sure you would be happier with that too, and so it's also your happiness which concerns me 🙂

      Best regards

      Author's profile photo Former Member
      Former Member

      You think YOU are annoying? Join the club I say... 😉

      Not being afraid to stand for what you believe in is one thing, which I'm usually doing to the point of annoyance, so for me at least that's actually the easy part. However, I think my underlying message here that it's not (always) just the old narrative of us techies fighting management to get more time, resources and understanding. Rather, sometimes our job is inherently not as "academic" as we would like it to be. Rather than an excuse, this should be a catalyst for more discussion inside the technical community, I think: how do we factor best for all the unknowns? We could have set a really high bar for each and every action in the system, and have it surely crash - but then we would learn nothing. Sometimes it's simple work done wrong (as is your point about AJAX), but sometimes the point is more subtle and regards usage of new features in a way and volume you did not expect from a FUNCTIONAL perspective. In a system where highly skilled individuals are performing a well-defined transactional process, it's actually much easier to predict behavior than in a system like SCN - but this only means we should try harder.  We can always hide here behind the fact the "it's the functional guys' job to give us this info" - but they cannot, and so they would not, so we try to account for that.

      You really pointed here to one of the more important and touchy subjects: When is a project deemed "ready"? If you just KNOW it's not ready, that's easy to fight for. However as an architect you are part of the decision making and must overcome some fears in that department. If you said "we're good enough for this imperfect world!" and you were right, cool. If you were wrong but fixed the mess before anyone really noticed, great (and who with enough time in the industry does not have such a confession?). If you erred, well, you erred and have to revisit you steps honestly. I think we are now getting the perspective needed - we could not do this while still "in the trenches". So, I accept your criticism - explaining why it can so easily go wrong does not equal justification. However, I try to go beyond the dichotomy of techies vs. "all the others".

      Regarding internal deployments, I'm always happy to hear about true commitment to quality even in places where change could just be shoved down people's throats and then fixed "sometime". I'm only commenting from my own experience, which is honestly btw pre-SAP...

      Again, I think some major changes are easier to accomplish when you have a limited number of actors which own data, when the users are all known and highly-skilled, when they MUST CONFORM to procedures that you set etc. That's why I think external systems are harder...and that's why I'm doing one right now 😉

      One point in which I think we might differ, is the extent of the "disaster" and the site where it hurts the most. Not to underestimate the performance side of the story, the biggest problems in that department were solved within less than 3 days. The next leap in stability (again, based on statistics a.k.a. "lies") took a bit more. I think this is bad, but more worrying to me is the functional issues. Again, this is a question of perspective, as I am by nature looking at bug reports much more than on non-buggy behavior of users, and even I see a lot of activity and potential despite problems. My biggest problem is probably that I'm an optimistic underneath it all...Of course, I'm also not objective here, and to some extent neither is any involved member of our community. If someone shouts "This and this sucks!", well, at least they care enough and really deserve to be attended to.