Skip to Content
Technical Articles

How many millions of dollars enterprises waste due to Garbage collection?

We truly believe enterprises are wasting millions of dollars in garbage collection. We equally believe enterprises are wasting these many millions of dollars even without knowing they are wasting. Intent of this post is to bring visibility on how several millions of dollars are wasted due to garbage collection.

What is Garbage?

All applications have a finite amount of memory. When a new request comes, the application creates objects to service the request. Once a request is processed, all the objects created to service that request are no longer needed. In other terms those objects become garbage. They have to be evicted/removed from the memory so that room is created to service new incoming requests.

Garbage collection evolution: Manual 🡪 Automatic

3 – 4 decades back, C, C++ programming languages were popularly used by the development community. In those-programming languages garbage collection needs to be done by the developers. i.e., application developers need to write code to dispose of unreferenced objects from the memory. If developers forget (or miss) to write that logic in their program, then the application will suffer from memory leak. Memory leaks will cause applications to crash. Thus, memory leaks were claimed to be quite pervasive back in those days.

In the mid-1990s when the Java programming language was introduced, it provided automatic garbage collection i.e., developers no longer have to write logic to dispose of unreferenced objects. Java Virtual machine will itself automatically remove unreferenced objects from memory. Definitely it was a great productivity improvement, developers enjoyed this feature. On top of it, a number of memory leak related crashes also came down. Sounds great so far, right? But there was one catch to this automatic garbage collection.

To do this automatic garbage collection, JVM has to pause the application to identify unreferenced objects and dispose them. This pausing can take anywhere from a few milliseconds to few minutes, depending on the application, workload & JVM settings. When an application is paused to do garbage collection, no customer transactions will be processed. Any customer transactions that are in the middle of processing will be halted. It will result in poor response time to the customers. So, this was the trade-off, i.e., for developer productivity and minimizing memory leak related crashes, application pause times got introduced in automatic garbage collection. By doing effective tuning we can bring down the pause time, but it cannot be eliminated.

This might sound like a minor performance hit to the customer’s response time. But it does not stop there, today enterprises are losing millions of dollars because of this automatic garbage collection. Below are the interesting facts/details.

Garbage collection Throughput

‘GC Throughput’ is one of the key metrics that is studied when it comes to Garbage collection tuning. This metric is cleverly reported in percentage. What is ‘GC Throughput %?’. It is basically the amount of time application spends in processing the customer transactions vs amount of time application spends in processing Garbage collection activities. Say suppose application has 98% as it’s GC Throughput, it means application is spending 98% of its time in processing customer transactions and remaining 2% of time in processing Garbage collection activities.

Does 98% GC throughput sound good to you? Since human minds are trained to read 98% as A grade score, definitely 98% GC throughput should sound good. But in reality, it is not the case. Let us look at the below calculations.

In 1 day, there are 1440 minutes (i.e. 24 hours x 60 minutes).

98% GC throughput means application is spending 28.8 minutes/day ingarbage collection. (i.e., the application is spending 2% of time in processing GC activities. 2% of 1440 minutes is 28.8 minutes).

What is this telling us? Even if your GC throughput is 98%, your application is spending 28.8 minutes/day (i.e., almost 30 minutes) in Garbage collection. For that 28.8 minutes period your application is pausing. It’s not doing anything for your customer.

One way to visualize this problem is: Say you have bought a brand-new expensive car and you want to drive this car for a couple of hours. How will you feel if the car runs only for 1 hour and 50 minutes, but stops intermittently in the middle of the road for 10 minutes, and still ends up consuming gasoline? This is what is happening exactly in automatic garbage collection. JVM keeps pausing intermittently, while application is still processing customer transactions.

Dollars wasted

Even healthy application’s GC throughput ranges from 99% to 95%. Sometimes it could go even below than that. In the below table I have summarized how many dollars mid-size(1K instances/year), large-size(10K instances/year) and very large(100K instances/year) enterprises would be wasting based on their application’s GC throughput percentage.

GC Throughput % 99% 98% 97% 96% 95%
Minutes wasted by 1 instance per day 14.4 min 28.8 min 43.2 min 57.6 min 72 min
Hours wasted by 1 instance per year 87.6 hrs 175.2 hrs 262.8 hrs 350.4 hrs 438 hrs
Dollars wasted by mid-size company (1K Instances per year) $50.07K $100.14K $150.21K $200.28K $250.36K
Dollars wasted by large size company (10K Instances per year) $500.72K $1.00M $1.50M $2.00M $2.50M
Dollars wasted by X-Large size company(100K Instances per year) $5.00M $10.01M $15.02M $20.02M $25.03M

Here are the assumptions I have used for our calculation:

  1. Midsize enterprise would have their application running on 1000 EC2 instances. Large size enterprises would have their application running on 10,000 EC2 instances. Very large enterprises would have their application running on 100,000 EC2 instances.
  2. For our calculation, I assume these enterprises are running on t2.2x.large 32G RHEL on-demand instances in US West (North California) EC2 instances. Cost of this type of EC2 instance is $ 0.5716/hour.

From all the below graphs you can notice the amount of money midsize, large size and very large size enterprise would be wasting due to garbage collection:

Fig: Money wasted by midsize enterprise due to Garbage Collection
Fig: Money wasted by large size enterprise due to Garbage Collection
Fig: Money wasted by very large size enterprise due to Garbage Collection
Note 1: Here I have made calculations with assumptions GC throughput ranges only from 99% to 95%, several applications tend to have much poorer throughput. In such circumstances the amount of dollars wasted will be a lot more.

Note 2: I have used t2.2x.large 32G RHEL instance for calculation. Several enterprises tend to use machines with much larger capacity. In such circumstances, the amount of dollars wasted will be a lot more.

Counter arguments

Following are the counter arguments that can be placed against this study:

  1. For my study I have used AWS EC2 on-demand instances, rather I could have taken dedicated instances for my calculations. Difference between on-demand and dedicated instances is only approximately 30%. So, the price point can fluctuate only by 30%. Still 70% of the above cost is outrageous.
  1. Another argument can be AWS cloud is costly, I could have used some other cloud provider or bare metal machines or serverless architecture. Yes, these all are valid counter arguments, but they will shift the calculation only by a few percentages. But the case that garbage collection is wasting resources cannot be disputed.

You are open to articulate any other counter arguments in the comments section. I will try to respond to it.

Conclusion

In this post I have presented the case on how an exorbitant amount of money is wasted due to garbage collection. Unfortunate thing is: money is wasted even without our awareness. As applications developers/managers/executives we can do the following:

  1. We should try to tune garbage collection performance , so that our applications starts to spend very less time on Garbage collection.
  1. Modern applications tend to create tons of objects even to service simple requests. Here is our case study which shows the amount of memory wasted by the well celebrated spring boot framework. We can try to write efficient code, so that our applications tend to create very less number of objects to service the incoming requests. If our applications create a smaller number of objects, then very less garbage needs to be evicted from memory. If garbage is less, the pause time will also come down.

2 Comments
You must be Logged on to comment or reply to a post.
  • Hello Ram Lakshmanan,

     

    don't know the specifics of AWS. Your blog sounds like GC is evil, which isn't.

    Let's assume how cloud "should" be, for a moment. A cloud provider has tons of physical machines which feed even more virtual machines. A customer application should be fed by a load balancer which always provides the most "fresh" instance to service it.

    If a cloud provider bills you the instances that are on "cool-down" (i.e. GC or other) you should change your cloud provider.

    That's for cloud in particular.

    Now one for GC in general. You say when your instance is GC-ing it does nothing for your customer. That's not true. It cleans the mess to service the next customer better. Like cleaning the space in a restaurant for the next patrons. Do you want to eat amidst the mess of your previous patrons?

    Your comparison with a car: Cars are machines that need service, too. They can't run for months without pause. You need to take into account the different service times of the two worlds. A car trip can take hours while a customer service can be a matter of minutes or even seconds.

    Your arguments are valid if you assume the very worst conditions. But in reality nothing is eaten as hot as it is cooked.

     

    Regards,

    Manfred Klein

  • "How many millions of dollars enterprises waste due to Garbage collection?" Sorry, this headline is a bit unfair and about as useful as "How much money do you waste breathing."

    1) Memory management is an intrinsic part of every non-trivial program in whatever language you choose. No exception.

    With high-level languages like Java or Python, a GC takes care of that for you, freeing costly developer resources. That means you pay less for development and maintenance and get by with less experienced developers. Which are cheaper and more plentiful, so your talent pool is larger.

    With lower-level languages like C/C++ you do manual memory management and therefore need more experienced developers. You pay much more for them if you can even get them. E.g., good luck hiring a decent kernel developer.

    But even with good devs, you still end up hunting memory leaks, double frees, and buffer overruns, which can become very costly very quickly. That's why business logic is usually not programmed in C.

    2) You single out GC footprint as the big problem. But every layer in the stack down to the raw silicon generates waste and CPU overhead if it does non-trivial memory management. How much space do you think does the glibc heap waste? You'd be surprised.

    Apart from all that, this article does not consider the ongoing developments in modern GCs, e.g., current low-pause GC algorithms like ZGC or Shenandoah.