In the Implications of Java Garbage Collection for Multi-User Applications of this ‘trashy’ blog, I introduced the impact of JAVA garbage collection on multi-user applications. In particular, I talked about the ‘stop time’ and the ‘bunching’ effects. The former simply takes into account the ‘static’ effect of thread halt during garbage collection, the latter describes the ‘dynamic’ effect of GC disarranging the request queue.
Last time I just gave you a taste of what these effects can mean for the response time performance in one specific example. Now I think it is time to open the black box a little and give details on the model I use to describe the GC effect. The aim is to arrive at an ‘educated rule of thumb’ which allows you to estimate the garbage collection impact for a wide range of parameters for application and garbage collection. To get there, we’ll need three steps:
1.We start with a little algebraic exercise: we calculate the pure stop time effect. Here the parameters of the game get introduced.
2. Then we proceed to including the bunching effect. I don’t know how to do this analytically, so I resort to a Monte Carlo simulation of incoming requests being serviced by a certain number of CPUs. And from time to time, a garbage collection comes along and mixes up the situation. From the simulation, it is straightforward to extract the GC effect on the average response times. But unfortunately, to get the answer for other conditions (like a different duration of a full garbage collection or a different raw service time of the application) we have to run the simulation. Not very convenient.
3. Therefore, we do step 3: we interpret the simulation result in terms of an effective parameter, the effective full GC duration. With this, we go back into the formula derived in 1. and finally arrive at the calculable ‘educated rule of thumb’ advertised above.
Altogether, this is a bit much to swallow in one piece, and I already promised last time to be a bit briefer this time ;-). So today, I’ll just do step 1:
1. Calculating the pure stop time effect
This is going to be a bit technical, and it’s important to stay on top of the symbols which pop up. Therefore let me start with the collection of parameters which will be used:
n: number of users
r: raw request service time
d: user think time
gc: single GC duration
fgc: GC frequency
We assume that we have n users. Equally distributed over time they trigger service requests with raw service time r, and then wait for a think time d. This is the situation sketched on the left part of the picture below. Then a garbage collection comes along and stops the world. Below I show the situation for the case that the GC duration gcis smaller than the think time d. One of three things will happen for the users:
(i) GC comes while a request is being serviced (user 1). This leads to an increase of the total service time of Δr1 equals the GC duration gc. As the user requests are assumed to be distributed equally over time, we can also give the number of users which face this situation: n1 = n * r/(d+r).
(ii) GC comes along during the think time, and is finished before the think time is over (user 2). The number of these lucky users is n2 = n * (d-gc)/(d+r).
(iii) GC comes along during think time, but during GC the request becomes active. This hits n3 = n * gc/(d+r) users. They experience all possible delays between 0 and gc, at average Δr3 = gc/2.
Now we put together the total response time increase per user for a single garbage collection:
An equivalent calculation can be done for the case gc >d.
With the garbage collection frequency fgc, we get for the average response time Rav = r+Δr:
This result describes the average response time increase under the influence of GC stop times. NOT included here is the bunching effect, which we’ll treat next time.
Let’s have a closer look at the content of the brackets. The “1” we understand. The second term fgc*gc is just the time fraction which is spent in garbage collection. This can be called the “naïve” guess for the effect: “if we spend 2% of the run time in GC, the response time increases by 2%”. But of course that is not correct – due to the third piece. For the case gc<d , this is a quadratic term in gc. And we can easily see that it will dominate over the first term as soon as the GC duration becomes larger than double the raw service time. For gc > d we switch to the lower equation, which gives a linear continuation of the response time increase towards large GC durations.
Here two plots for illustration. I assume a raw response time r of 0,25 s and think time d of 9,75 s. For the first plot, also the GC frequency fgc is kept fixed (at 0,004/s), so only gc is varied. Then we see nicely the quadratic behavior for gc<d and the linear continuation for gc > d. The dotted line shows the result for the “naïve” guess Rav=r(1+fgc*gc). This guess is close to reality only if the GC duration is very small (note that therefore it normally is a good estimate for minor GCs!).
The plot above shows the direct dependence of the average response time Rav from the GC duration gc. All other parameters are kept constant, so we see the nice quadratic behavior. But keeping the other parameters constant means that the total time spent in garbage collection also increases with the GC duration! You might rightly say “no wonder that the response time increases when we spend more and more time in garbage collection!”. We should also have a look at the result under the condition that the total GC time is kept constant (i.e. fgc*gc is constant). This is shown in the plot below (again compared to the “naïve” guess as dotted line). Under this condition, we see a linear increase of the average response time for gc<d, and a levelling off towards large GC durations.
Ok, so much for step 1: the pure GC stop time effect. After digesting this we will be ready to include the GC request bunching effect. That additional effect does not drop from heaven but naturally appears if we step back, look at the big picture – and find out that we omitted an important fact: there is only a finite number of CPUs to satisfy the requests. We’ll look at the consequences next time!