Pivotal Engineering Journal

Technical articles from Pivotal engineers.

Diagnosing Ruby Memory Issues in Cloud Foundry's API Server

How the Cloud Foundry CAPI team debugged Ruby memory usage issues in the Cloud Controller API

Grafana dashboard displaying abnormal Cloud Controller memory usage in a test environment


Debugging memory issues in software is a notoriously difficult problem. Thankfully, there are already many other excellent blog posts describing techniques for discovering and fixing memory usage issues for Ruby programs. Many of these posts assume, however, that their audience has direct access to the affected environment, and for our particular situation this was sadly not the case. In this post, we will cover how we diagnosed and fixed excessive memory usage for a Ruby web server, all without directly interacting with it. Before we begin our story, here is some background information.


Cloud Controller is the API server that Cloud Foundry users rely on to view and interact with the rest of the system. For instance, whenever a developer uses the Cloud Foundry CLI to deploy their applications or view their applications’ status in Apps Manager, they are ultimately making requests that are served by Cloud Controller. Cloud Controller is a Ruby application and is deployed in data centers throughout the world as part of Cloud Foundry.

Occasionally our end-users will use the platform in ways we might not have predicted, which results in unique and difficult-to-reproduce issues. In this instance, a customer reported that their Cloud Controller instances were consuming excessive amounts of system memory and restarting very frequently. By default, Cloud Controller servers are configured to restart themselves after sustained high memory usage – but these situations are rare and unexpected. Something was clearly wrong.

Sawtooth memory usage pattern showing frequent restarts of the Cloud Controller server

Cloud Controller instance exhibiting abnormal memory usage

Understanding the Problem

We did not see this restarting behavior in our own test environments, so reproducing the problem required a better understanding of the customer’s environment and API usage. We began by asking the customer about the size and shape of their data (e.g. number of running apps, spaces, apps per space, etc.) and proceeded to comb through thousands of lines of Cloud Controller logs to get a feel for the type of API requests that were typically made in the environment. Unfortunately, this knowledge alone was not enough to reproduce the issue. We needed to get a better view into what was consuming the memory on their environment. We needed to get a Ruby heap dump!

Collecting Ruby Heap Dumps

By default, Cloud Controller does not trace object allocations or dump its heap, so we needed to get a bit creative. Fortunately, we already had code in place to dump basic diagnostics when the "USR1" signal is sent to the Cloud Controller process. Taking advantage of this, we constructed a patch for the customer that would extend this diagnostics dump to include the heap before and after a forced garbage collection. This patch modified the code to look something like this:

Understanding Ruby Heap Dumps

A Heap dump shows all of the objects that exist in the Ruby heap when the dump is generated. These objects are tagged with useful information like when and where they were allocated, and what objects have pointers to them. Time is divided into a series of “generations” – the periods between garbage collector runs. These generations do not directly correspond to clock time, but do give a sense of when objects were created relative to other objects in memory.

There is a lot of raw data in heap dumps, but they are not very easy for humans to understand. To help reason about what objects were sticking around at the time of the heap dump, we used the gem heapy. Heapy processes the heap dump and groups objects by creation generation, allocation location, and storage location. This makes it easy to see when and where the objects in the heap were allocated.

For a typical healthy Ruby program, we would expect most of the objects in the heap dump to be allocated either during the early or late generations. Objects from early generations are allocated when the program first starts and include core Ruby classes (remember that Ruby classes are objects) or classes used by the web server across multiple requests. If there are many objects persisting from intermediate generations, this can be a sign that something is going wrong. That something could be a backed up queue of long-lived requests, a memory leak, or another issue entirely.

Our heaps looked similar to this:

Notice how generation 125, a middle generation, looks abnormally large in comparison to its neighbors. Using heapy we can dive in further and see where the objects using this memory were allocated.

Sifting Through the Dump

With the help of our friend heapy, we were able to get a sense of where all of these objects were coming from. The allocation point for our leaked objects was where our ORM loaded objects out of the database. Unfortunately most of what we do is load ORM models out of our database, so this was not immediately helpful. Since classes in Ruby are also objects, we were able to write a script to count the number of instances of different classes in the heap dump.

421453 "0x5642f6a6fa88"
211200 "0x5642f6a8fe00"
210912 "0x5642f6a7e240"
210687 "0x5642f933aa80"
  70 "0x5642f6a7ffa0"
  20 "0x5642fbb3c920"
  20 "0x5642fb5bf2b8"
   2 "0x5642fab0caa0"
   2 "0x5642faaebf80"
   2 "0x5642fa9f18c8"
   2 "0x5642fa9c6b00"
   2 "0x5642fa8971a8"
   2 "0x5642f789add8"
   2 "0x5642f6f69020"
   2 "0x5642f6d036c8"
   2 "0x5642f6a6d0d0"
   2 "0x5642f6a6c888"
   1 "0x5642fbb3db68"
   1 "0x5642fa82c240"

Once we had frequencies of certain objects, we focused on one of the objects with an extremely high frequency, such as 0x5642f933aa80. We needed to find out the class of this object, so we searched through the raw dump again using ag (the Silver Searcher).

$ ag --mmap 0x5642f933aa80 heap_dump_after_gc | grep '"type":"CLASS"'

232204:{"address":"0x5642f933aa58", "type":"CLASS", "class":"0x5642f9263328", "name":"Class", "references":["0x5642f925b678", "0x5642f933aa80", "0x5642f737e638"], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/app/models/runtime/user.rb", "line":2, "generation":72, "memsize":504, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

232205:{"address":"0x5642f933aa80", "type":"CLASS", "class":"0x5642f933aa58", "name":"VCAP::CloudController::User", "references":["0x5642f92812d8""0x5642f930d670"], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/app/models/runtime/user.rb", "line":2, "generation":72, "memsize":5864, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

Based on the output above, we saw that 0x5642f933aa80 corresponded to the VCAP::CloudController::User class object. We could now identify the object!

210687 "0x5642f933aa80" ----> VCAP::CloudController::User

Tracing the Referenced Objects

Once we knew the identities of the most frequently occurring objects, we just needed to figure out where they were coming from. The relative quantities of different types of objects might be a clue: if the heap is full of a particular resource, maybe it is the list endpoint for that resource, or another endpoint that loads it. This was not enough information in our case, so we looked at some instances of the leaked classes and traced up their memory addresses to see what objects were holding on to them so tightly.

Every time we searched for a given object address, we also found out its class using the same process as earlier. For example:

$ ag --mmap 'class":"0x5642f933aa80' heap_dump_after_gc | head

11617:{"address":"0x5642f6bc5ef0", "type":"OBJECT", "class":"0x5642f933aa80", "ivars":3, "references":["0x7fd083e28028"], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.4.0/gems/sequel-4.49.0/lib/sequel/model/base.rb", "line":264, "method":"allocate", "generation":125, "memsize":40, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

Above we found that 0x5642f6bc5ef0 was the address of an instance of a frequently occurring class. Time to find where this instance was referenced:

$ ag --mmap 0x5642f6bc5ef0 heap_dump_after_gc | less

11617:{"address":"0x5642f6bc5ef0", "type":"OBJECT", "class":"0x5642f933aa80", "ivars":3, "references":["0x7fd083e28028"], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.4.0/gems/sequel-4.49.0/lib/sequel/model/base.rb", "line":264, "method":"allocate", "generation":125, "memsize":40, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

2300469:{"address":"0x7fd08762f980", "type":"ARRAY", "class":"0x5642f6a7ffa0", "length":10002, "references":[“0x5642f6bc5ef0”, ...], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.4.0/gems/sequel-4.49.0/lib/sequel/dataset/actions.rb", "line":1051, "method":"_all", "generation":125, "memsize":89712, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

In this example we saw that the User instance was a member of a large array, but we did not know what object owned that array. We continued this process by searching for the address of the array.

$ ag --mmap 0x7fd08762f980 heap_dump_after_gc | less

2300484:{"address":"0x7fd08762fbd8", "type":"HASH", "class":"0x5642f6a7e240", "size":1, "references":["0x7fd08762f980"], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.4.0/gems/sequel-4.49.0/lib/sequel/model/associations.rb", "line":2248, "method":"associations", "generation":125, "memsize":192, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

Now we had found that our array was referenced by a hash, so we continued the process and found what object referenced this hash.

$ ag --mmap 0x7fd08762fbd8 heap_dump_after_gc | less

4376456:{"address":"0x7fd0e036d888", "type":"OBJECT", "class":"0x5642fae5e2d0", "ivars":3, "references":["0x7fd0e036d978", "0x7fd08762fbd8"], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.4.0/gems/sequel-4.49.0/lib/sequel/model/base.rb", "line":264, "method":"allocate", "generation":122, "memsize":40, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

We had finally reached a new, hopefully more interesting object. As before, we ran the following to find out what its class was:

$ ag --mmap 0x5642fae5e2d0 heap_dump_after_gc |  grep '"type":"CLASS"'

371974:{"address":"0x5642fae5e2d0", "type":"CLASS", "class":"0x5642fae5e2a8", "name":"VCAP::CloudController::Space", "references":["0x5642fb1f9db8" ... "0x5642faeb0c38"], "file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/app/models/runtime/space.rb", "line":4, "generation":72, "memsize":7960, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

It was a VCAP::CloudController::Space class! This turned out to be the most interesting class we found while walking up the object graph, but we didn't know that at the time so we continued the process above until we reached the root reference (a Thread object).

Since we found this VCAP::CloudController::Space, however, we were able to start thinking about what endpoint would return a Space hash containing an array of Users.


After tracing the “leaked” objects all the way to the top, we found the EventMachine thread that holds on to all requests. Short of intentionally writing a bunch of objects to Thread.current, the only explanation to this was that the requests were still in progress. The problem was not the result of a memory leak after all! With an understanding of the shape of the data, we were able to identify the problem endpoint. This endpoint needed to see whether or not a User belonged to a Space so it searched through an array of in-memory User objects in Ruby, rather than executing a simple SQL query. Worse yet, the array of User objects was reloaded and searched through for each of the API user's applications. This meant that an individual with access to many applications would trigger the same expensive operation for each application.

We learned that the environment that was exhibiting this behavior was a “sandbox” environment that gave all developers access to all other applications. To compound the problem, the environment also had an API consumer that was polling and fetching the entire list of applications every 30 seconds. Once enough of these requests stacked up, the Cloud Controller was no longer able to serve requests and eventually hit its memory quota, triggering a restart. We replaced the offending line with a SQL query and everything recovered. Like any good bug, the issue came down to fixing a few lines of code.


Over the course of our investigation, we came up with some takeaways that we believe would help any team trying to diagnose a Ruby memory leak:

  • Take advantage of the tools: Heap dumps make sense to computers, but not to humans. Using tools like heapy allowed us to make sense of a heap dump even though we are not robots.
  • Do not ONLY test as admin: One of the original reasons our initial investigation did not reproduce the problem was because we were using an admin account. The code path exhibiting the memory issue was only exercised by non-admin users.
  • Be careful of your ORM shielding you from object allocation: One of the strengths of using an ORM is that it shields you from having to know the specifics of which DB operations are occurring. In our case, it was not obvious whether a line of code was loading the entire contents of a table into memory and then filtering vs. filtering with SQL and only loading the results. It is often worth taking a peek at the SQL statements generated by the ORM to see if they match your expectations.
  • Offload as much as you can to the DB: A surefire way to avoid consuming memory in your application is to not load it in the first place. DBs are really good at filtering data before returning it to you!
  • Test extreme scenarios: Even in our non-admin testing and other production environments, we had not seen the memory problem. It’s worth doing at least some spot tests where you ramp up the quantity of a specific resource and run your performance tests.
  • There IS an answer: It’s easy to get discouraged over a long-running investigation, so take heart! You will figure it out eventually.