We Added a Cache; Three Days Later It Took the Database Down at Peak

#rails #performance #caching #redis

Rails Performance: Lessons from Production — #5 · Caching, part 2

The previous post covered how caching works (the four layers). This one is about how it bites — the failure modes you only meet after it ships. Adding a cache is one line of fetch, but what it buys you is a whole new class of problems: hammering the DB the instant a key expires, logic breaking when the cache vanishes, memory blowing up. Same example throughout (a shipments table).

🐎 Trap 1: the instant it expires, everyone hammers the DB at once (stampede)

We cached the homepage's 800ms ranking stat with expires_in: 5.minutes. The homepage flew. Felt great.

Three days later, at peak hour, DB CPU suddenly pegged at 100% and the whole site slowed down. The APM showed a spike every 5 minutes — like clockwork.

The cause was a cache stampede: the instant that key expired, all the requests arriving at that moment missed simultaneously and all ran that same 800ms query at once. The cache was supposed to protect the DB; instead, "everyone expiring at the same instant" made the DB take a huge spike in that one second.

How big a spike? It's not about raw RPS — it's RPS × recompute time. The query takes 800ms; during that 0.8s "miss window," every arriving request finds nothing and recomputes. Even a modest 50 RPS means ~40 requests running the same 800ms query in parallel — the DB starts gasping.

How to fix:

Don't give a pile of keys the same hard expiry. Switch to key-based (tied to the data version) and there's no "everyone expires at the same instant" time bomb.
If you genuinely need time-based expiry, use Rails' race_condition_ttl:

Rails.cache.fetch("courier_ranking", expires_in: 5.minutes, race_condition_ttl: 10.seconds) do
  Shipment.group(:courier_id).count
end

After expiry, only the first request that misses recomputes; the others get the just-expired old value immediately and return — they don't wait. Once the first finishes (~0.8s), the new value lands. race_condition_ttl: 10.seconds isn't "everyone waits 10s" — it's the maximum grace period the old value is kept alive in case the recompute drags.

Lesson: a cache protects the DB, but "caches expiring together" concentrates traffic onto the DB. When you add a cache, plan for the instant it expires.

🧱 Trap 2: treating the cache as the source of truth — Redis restarts and it crashes

Some logic was written like this — write a setting into the cache, then read straight back from the cache later:

Rails.cache.write("active_courier_ids", Courier.active.pluck(:id))
# ... elsewhere ...
ids = Rails.cache.read("active_courier_ids")
ids.each { |id| ... }   # one day this blew up: ids was nil

A Redis maintenance restart wiped the cache. That read returned nil, nil.each raised NoMethodError, and the feature died.

The root issue: we treated the cache as the "home" of the data. But a cache can disappear at any moment — Redis restarts, memory fills and entries get evicted, keys expire. It's not a database.

How to fix: a cache can only ever be a "recomputable copy," never the source of truth. If it's missing, you must be able to recompute it — use fetch (recomputes on a miss) instead of a bare read:

ids = Rails.cache.fetch("active_courier_ids") { Courier.active.pluck(:id) }

Lesson: always assume the cache will be gone the next second. Any code that "breaks if the cache is missing" is a landmine.

💣 Trap 3: putting something variable in the key — Redis memory blows up

To cache each user's search results, someone set the key like this:

Rails.cache.fetch("search/#{params[:q]}") { expensive_search(params[:q]) }

Looks reasonable. But params[:q] is free user input — every distinct search string spawns a new key. Once live, all sorts of bizarre query strings poured in, the number of keys grew without bound, Redis memory climbed until it blew up, and the whole cache service went down.

The problem: the key's cardinality is out of control — you think you're caching, but you're actually hoarding without limit.

How to fix:

Build keys only from bounded, controlled dimensions (category, id), never drop raw user input straight into the key.
Give Redis an eviction policy (maxmemory-policy), don't leave it at the default noeviction (which rejects writes and errors when full). For caches, allkeys-lru is common (drop the least-recently-used when full), but the point isn't which one — it's having a "drop the old when full" policy at all, so it doesn't blow up entirely.

Lesson: cache space is finite. Every key costs memory; a key that grows without bound is a memory leak.

☠️ Trap 4: caching the failed result

The ranking stat was changed to come from an internal API:

Rails.cache.fetch("courier_ranking", expires_in: 5.minutes) do
  RankingApi.fetch   # one time it returned nil (the API was briefly down)
end

That time the API briefly failed and returned nil, and fetch stored that nil as "the result" for 5 minutes. So the API was long since fixed, but for the next 5 minutes every user still got an empty ranking — a few-second outage amplified by the cache into a 5-minute site-wide failure.

How to fix: don't cache failed / empty results. If you get an abnormal value, don't store it (or let it expire immediately):

Rails.cache.fetch("courier_ranking", expires_in: 5.minutes) do
  result = RankingApi.fetch
  raise "empty" if result.blank?   # don't let an empty result get stored
  result
end

Lesson: a cache faithfully preserves your errors too. Before storing, confirm this is a result worth keeping for 5 minutes.

🏁 Wrap-up: the cost of caching

Caching makes a system faster, but it buys a whole new class of problems. These four traps are four faces of one sentence:

A cache is a copy that can vanish at any moment, will be faithfully preserved, and occupies finite space — so always think about the instant it expires, disappears, blows up, or stores the wrong thing.

the instant it expires → don't let everyone expire together (stampede)
the instant it vanishes → don't treat it as the source of truth
the matter of space → don't let keys grow without bound
the matter of correctness → don't cache a failed result

That one line of fetch makes caching look easy. But what actually makes you senior is, the moment you add a cache, anticipating every way it can go wrong — these four are the ones you only learn in production, and never forget once you do.