DEV Community

Alessandro Dolci for Zanichelli Editore

Posted on

Investigating Redis memory leaks due to Laravel Cache Tags 🔎

tl; dr ⏰

We found out that one of our production services was experiencing cache cluttering issues without apparent reasons. Leveraging AWS observability tools and appropriate Redis queries, we were able to find out that the issue lied in how the key tagging feature of Laravel Cache works, in particular when using a Redis backend.
In order to keep track of every key that’s associated with a specific tag, behind the curtains, Laravel maintains a Redis set which can grow indefinitely, causing potential memory outages. You can verify this by simply observing how your keyset evolves as you store tagged keys (even specifying a TTL) through the Cache interface.

Who we are 🏢

I’m currently part of the software development team at Zanichelli Editore, one of the leading companies in publishing and education in Italy. Our focus is on edtech products to support teachers and students activities both at school and university.

A good number of the web services we maintain are based on Laravel, which we’ve found to be particularly fit for our use cases and a good choice in general to bootstrap new projects in a fast and easy way. Having been working with the framework for some years now, we’ve also come across situations where we were starting to feel constrained by the framework, and we had the chance to know some of its inner workings and to collect a number of gotchas about it.

Noticing the issue 🚩

Let’s begin with a bit of context. Our identity provider service1, which is responsible for the authentication of users of all our applications, makes extensive use of a Redis cache to store session related data. This is very practical in general, because, for example, it allows us to have our users' sessions automatically invalidated by just setting TTLs on the keys we save (by requirement, this kind of sessions must live no longer than a handful of days).
Then, in general and with the exception of a few persistent keys that, for our convenience, are stored without an expiration property, we try to stick with having volatile-only keys on our Redis instances.
As a consequence of this we expect that, in a period in which our services don’t experience significant variations in load, pressure on cache storage should remain constant as well.
All our cache servers consist of managed AWS ElastiCache Redis clusters. On CloudWatch (i.e., AWS observability service) we’ve put together some useful graphics to help us monitor relevant metrics. One of them shows how memory usage evolves over time on the nodes. This is how it should look like under normal circumstances:

Memory usage percentage under normal circumstances

As you can see, our services undergo substantially higher load during workdays, as our audience is entirely made up of teachers and students and the apps are mainly used during school lessons. Starting with an empty node, memory usage ramps up until it reaches the point in time in which keys start to expire (that is, two weeks in).

What we instead were seeing was a trend in which memory usage was constantly increasing, over a period of time that was starting to become larger than what we’d normally have expected. Moreover, at the same time our analytics and tracking data were suggesting that the systems weren’t experiencing particularly odd traffic patterns.
Something was definitely off.

Memory usage percentage as observed

Contrarily to what we can see in the first chart, even after the first two weeks go by, usage keeps increasing (albeit with a slower pace), as if some of the keys do not expire as expected.

Digging deeper 🔎

Our first thought after seeing this was to check on every service that was storing keys on the involved node for some operation that wasn’t setting an explicit TTL. After all, having excluded traffic spikes from the equation, this was the only remaining option.
We started analyzing the codebase to spot evidence of this, but couldn’t find anything worth digging more deeper into. All of our explicit key storage calls were correctly specifying an expiration time.

Analyzing the key space

So, for the next step we connected to the affected Redis node through redis-cli and we started poking around, issuing some basic queries to try and get some insight.
We tried scanning for keys that did not have a TTL with a shell script that looked like this:

redis-cli --scan | while read LINE; do
   TTL=`redis-cli TTL "$LINE"`
   if [ $TTL -lt 0 ]; then
       echo "$LINE"
   fi
done
Enter fullscreen mode Exit fullscreen mode

The approach is pretty straightforward and, despite its brute-force nature, it should have got us to the point. Too bad this loop becomes quite slow when the node has relatively large storage capacity. The SCAN command takes a considerable amount of time to complete, but it lets us avoid blocking our production nodes for the entire query duration. For any of the extracted keys, we then run another read operation on the server, to output the TTL value.
We let the script run for quite some time, and we did in fact discover some keys without an expiration: a bunch of external dependencies we were using were caching configurations and internal data indefinitely. It was something, but the amount of these entries was negligible with respect to what we were expecting.
By this point, we were getting a bit short of ideas, so we took a step back to the Redis documentation. After some time, we stumbled upon what seemed to be a promising feature.

Scanning for big keys

It’s no surprise that Redis receives so much love from developers. Being simple to grasp and use is certainly its biggest selling point, but we also find that documentation and tooling are worth mentioning for their effectiveness.
When we read about the --bigkeys option, we immediately thought it could be worth giving it a shot. Until then, we had only considered the possibility of a large pool of never-expiring keys, but what if what we should have been searching for were instead a bunch of well-hidden large ones?
What this SCAN option does is looping over the entire key space, updating its output in real time with the largest key found so far. It is a slow command, as well, but we had nothing better to try, so we launched it and we waited.
After some time, a key with a size amounting roughly to 80% of the used space popped out. More specifically, it was a set, containing almost 4 million items, and we had no idea what it was, nor we noticed it with the previous search. The name was something along these lines: prefix:612d0a5216572240349046:standard_ref.
A quick search on the web revealed that it had to do with Laravel Cache Tags, a feature we couldn’t remember having ever used.

Cache tags

The idea, at least in the way Laravel implemented it, is that you can assign one or more tags to the keys at the time of storing them, so that you can easily flush all the keys associated with given tags when necessary. Once you store a key using tags, you must always specify them to operate on the key (a bit weird, isn’t it?).
To maintain a reference to the keys of a given tag, the framework uses a Redis set where it stores the name of each key. The flaw lies in the fact that, while such keys may have been given a TTL, their reference inside the set is there to stay. This can lead to considerable leaks of used space, especially when you do not even know some component of your application makes use of it.
How come this happened without us even knowing? By taking a look at the keys referenced in the set, we understood that they were being saved by a third party library that needed to maintain a deny-list for its internal functioning2. We knew about the list, but we had no idea that cache tags were involved in it.

Solution and final considerations 🟢

We finally revealed the mystery, and got a grasp about this weird tagging thing. By examining the library code, it became clear that there was no need for it to assign tags to the items (the library didn’t even contain code that made use of the flush-by-tag feature!). Fortunately, cache access was designed with a provider-based pattern, which let us replace the default component with one that accessed the cache directly, without specifying tags. Our cache was finally back to normality!

Now, I hear what you’re saying: “hey, but this must be a bug within tags implementation!”. Well, that’s what we initially thought, but we immediately realized that the real problem lies in the fact that Redis does not define any automatic expiration option for set items, making it basically impossible to avoid this kind of pitfall.
Since the time of analyzing this issue, Laravel authors seem to have taken more awareness of this, and added a notice in the migration guide for the last version of the framework to warn users. A schedulable script that removes stale references has been also made available to guard against cache cluttering.
Other alternatives exist to obtain tagged cache functionalities. Alternative Laravel Cache is another implementation that tries to address some of the quirks of the default tags API. Despite being based on php-cache libraries and sharing little code with Laravel default implementation, it still seems to suffer from the issue we just described.

Conclusion 🏁

With this article we tried to illustrate our experience in debugging issues with cache instances within a core production service. We examined the concept of cache tags, how Laravel framework makes them available and what issues could come out of improper usage.

After having dealt with all this, our opinion about cache tags is not so favorable. Speaking generally, they seem to add limited value for the cost they bring in usage and maintenance. They sure have their use cases, but one must make sure to know about the limitations we just talked about before going for them.

Thank you for your interest, we hope you enjoyed the reading, make sure to keep in touch with us for more articles like this!

References


  1. You can find a stripped-down version of the service we run in production here

  2. It's the library we use to handle user tokens (GitHub page). The list stores references to tokens that have been invalidated. 

Top comments (0)