Akarshan Gandotra

Posted on May 4

Part 8 — Making It Fast: Caching, Hot Paths, and Avoiding DB Calls

#performance #caching #go #redis

The Auth Gateway sits in front of every authenticated request in the platform. Its latency isn't just its own latency — it's the floor for every service behind it. If auth takes 50ms, every request to every upstream service starts 50ms in the hole.

Our internal target is sub-millisecond on cache-hot paths. The way we hit it isn't clever algorithms — it's a stack of small caches, each one handling a different kind of state, each invalidated through a different channel. This post walks through all of them.

The principle that shapes everything

Before the individual layers: a rule we hold as policy.

Redis is allowed to influence the hot path. Redis is not allowed to block it.

Every cache in the system is in-process. Redis feeds them asynchronously — pushing revocation events, triggering trie reloads, syncing SA versions. But a pod whose Redis connection is dead can still answer requests correctly, for the duration of its staleness window.

That's the difference between "Redis is down, the platform is down" and "Redis is down, the platform is slightly stale." One is a severity-1 incident. The other is a degraded mode we can tolerate for minutes while someone fixes it.

With that framing, here's how a warm request flows through the cache stack:

Six layers. Five are pure in-process memory. The sixth — revocation — is in-process too, but fed asynchronously from Redis. No layer blocks on a network call.

Layer 1: JWT verify cache

The single biggest win in the stack. RSA signature verification is expensive — a few hundred microseconds per call — and at 50,000 RPS that cost is real.

We wrap the entire decode-and-verify path in a Ristretto cache. The key is a 64-bit FNV hash of the raw token string; the value is the decoded JWT claims. On a cache hit, we skip RSA verification entirely.

A few choices worth explaining:

Why Ristretto over a plain LRU. Ristretto uses TinyLFU — it tracks access frequency and uses it to decide what to evict. Under burst traffic, a pure LRU can evict frequently-used tokens just because they weren't the most recent. TinyLFU keeps the hot tokens and evicts the cold ones. The behavior under load is meaningfully better.

Why hash the token string. Two reasons. Memory: a JWT is 500–2000 bytes; a uint64 is 8. And defense-in-depth: if the cache state ever ends up in a log or heap dump, the tokens themselves aren't exposed.

Why cap TTL at 30 seconds. The cache stores the decoded token, not the auth decision. Revocation is checked separately on every request. But capping TTL at 30 seconds keeps the staleness window honest — a token that's been revoked won't ride a warm cache entry for an hour.

Layer 2: RSA public key cache

Per-tenant RSA public keys are loaded from environment config at boot. Parsing PEM is not free — a few hundred microseconds — and we don't want to pay it on every cache miss.

We cache the parsed key per tenant using sync.Once. The first request for a given tenant parses the key; every request after that gets the cached result, including if the first parse failed.

Two operational details that matter:

A misconfiguration fires a Slack alert once per tenant per pod, not once per request. Without this guard, a single bad key config generates a Slack message for every request that hits that tenant, which during a deploy is thousands of messages in seconds.

Key rotation requires a pod restart. We considered hot-reloading. We chose deploy-to-rotate — the operational simplicity of a predictable restart beats the complexity of a file watcher and the failure modes it introduces.

Layer 3: route cache

The trie lookup is already fast — O(depth), with depth typically 3–5 segments. But re-walking the same paths 50,000 times a second is wasteful. A TinyLFU cache sits in front of the trie, keyed by slug, HTTP method, and path.

The platform has around 3,000 distinct route tuples in production. Sized at 10,000 entries, the cache fits the entire steady-state working set with room to spare. Misses are new endpoints, cold starts, and post-reload warm-up.

Invalidation is bulk. On any trie reload — whether triggered by a periodic interval or a Redis Pub/Sub kick — we drop the entire route cache. We considered partial invalidation (only drop entries for changed slugs) and rejected it. Trie reloads are rare. The cache refills in milliseconds. The bookkeeping complexity of partial invalidation isn't worth the seconds of warm-up time it would save.

Layer 4: the trie

The trie is a cache too, just an unusual one. It's an in-memory mirror of the endpoint table from Postgres. No request ever touches Postgres on the hot path.

Invalidation has two channels:

Periodic: every hour by default. A safety net.
Push: via Redis Pub/Sub on auth:trie:refresh. Admin tooling publishes this after any write to the endpoint table. Pods reload within milliseconds.

The push channel exists because endpoint changes are operationally significant. A new admin route that's meant to be protected shouldn't have a one-hour window where it's open because the trie hasn't refreshed. The push channel closes that window.

Layer 5: policy bitmap snapshot

The permission bitmap (covered in the previous chapter) is loaded alongside the trie. It's an in-memory structure mapping permission names to bit indexes, with a version number.

The snapshot is never partially updated. It's swapped atomically — a background process builds a new snapshot when the registry changes, then stores it via an atomic pointer swap. Readers grab the pointer at the start of a request and work with that exact snapshot throughout. No locks, no torn reads.

This pattern shows up repeatedly in the codebase: when state changes as a whole unit, an atomic pointer is simpler and faster than a read-write mutex around a map. It's worth internalizing.

Layer 6: revocation map and SA version map

These were covered in depth in the previous chapter. In the context of the cache stack:

The revocation map is bounded at 50,000 JTIs, fed by a Redis Stream, and fails open — if a JTI isn't in the map, we treat it as not revoked. The staleness window is low single-digit seconds in steady state.

The SA version map has the opposite posture: fail closed. If the map isn't ready, the pod doesn't pass readiness. If a service account token's version is behind the current version in the map, it's denied.

Same underlying shape — in-memory map fed asynchronously from Redis — but different risk tolerance based on what's being protected.

How all the invalidation channels fit together

Three patterns across the stack:

TTL-based (JWT verify cache). Simple, no coordination. Best when the cached value has a natural expiry built into it — which JWTs do.

Push-based (trie, revocation stream, SA version). Required when a staleness window has real cost. Needs a degraded-mode plan for when the push channel is unavailable.

Capacity-based eviction (route cache, JWT cache). Bounded memory by design. What gets evicted matters more than when — which is why TinyLFU beats LRU for this workload.

When in doubt, start with TTL. Push-based caches are powerful but bring failure modes — lost events, stalled consumers, cursor races. Use them only when a TTL window is genuinely unacceptable.

The cache we got wrong

Our first JWT cache used a plain Go map with a mutex and a time.AfterFunc per entry to handle expiry.

It worked in tests. It fell over in production within a week. Two problems:

Goroutine pressure. Every cached token spawned a timer goroutine. At a million live tokens, the Go scheduler handled it — but GC pauses got ugly and unpredictable.

No cap. There was no size limit. Memory grew until pods OOM-killed.

Switching to Ristretto solved both: timers are amortized into a small internal worker, and MaxCost enforces a hard ceiling.

The lesson: a cache is a copy of state. If there's no mechanism to bound or invalidate it — TTL, push, or capacity — it's not a cache. It's a memory leak.

Cold start vs. warm

A pod's first requests are slower. The trie loads from Postgres before readiness flips — that's the only DB call in the pod's lifetime. After that, every lookup is in-memory.

The JWT cache starts empty on a fresh deploy and fills up within seconds as real tokens come through. We don't pre-warm it — the cost of cold RSA verifications for a few seconds after a deploy is acceptable.

The revocation cache we do pre-warm, synchronously, before readiness. A pod that's marked ready must have the current revocation set. Otherwise it would fail-open on every request until its first Redis sync — meaning any logouts from the past hour would be invisible to it.

What to actually graph

For each cache, the metrics that matter:

Hit rate — the most important number. A cache with a stable size but falling hit rate is broken.
Eviction rate — meaningful only if the cache is bounded. High eviction with high hit rate is fine; it means the cache is doing its job under pressure.
Size — useful for capacity planning, not for alerting.

The JWT verify cache runs at 95%+ hit rate in steady state. A fresh deploy drops it to zero and it climbs back within seconds. Anything else warrants investigation.

Don't alert on cache size. Alert on hit rate.

Next up: Chapter 9 — operating the gateway. The structured auth decision log, OpenTelemetry tracing, the three Kubernetes probes, degraded-mode behavior, and the Slack alert pattern that keeps on-call sane during a Redis outage.

DEV Community