Discussion on: Tell me a bug story

View post

I work on a Ruby API that serves acts as a core piece of the control plane for an open-source PaaS platform called Cloud Foundry. Our users install and operate the platform on the infrastructure of their choice (on-prem vSphere, AWS, GCP, Azure, etc.) and everyone tends to use it a little bit differently. This leads to lots of possible configurations and makes certain types of bugs hard to triage and even harder to reproduce.

One bug (or unforeseen usage pattern) we had seems really obvious in hindsight, but ended up taking weeks of investigation. We had some users report that their APIs were consuming huge amounts of memory and every six-minutes would reach ~8GB of ram usage and restart. Now Ruby isn't the most lightweight programming language, but it shouldn't be that bad! We initially expected a bad memory leak, but pausing the interpreter and manually forcing garbage collection was able to free up most of the memory. So we ended up crawling through heap dumps (wrote a blog post on this process with my team) and eventually found out there were tons and tons of User model objects in memory.

Turns out that this installation had all of their users (10,000+) belonging to the same organizational unit (called a Space) and we had a frequently-accessed line of code that was loading this full array of users into memory every time an API endpoint was hit. It was simply trying to do an existence check to see if a particular user was a member of the Space in question, but because of how we used our ORM it was instantiating and loading all users within that space into memory. Since our test environments (and many other production environments) tend to only put dozens or hundreds of users in a Space we hadn't encountered this.

The fix ended up being super simple:
Do the existence check in SQL instead of in Ruby

😂