Akarshan Gandotra

Posted on May 4

Part 7 — Token Revocation Without Killing Performance

#redis #jwt #performance #go

JWTs have a hard problem hiding inside them: they're stateless. The whole point of a JWT is that the verifier can check a signature and make a decision — no database, no round-trip. That's what makes them fast. It's also what makes "log this user out right now" not work out of the box.

We had to solve this. Users log out. Admins disable accounts. Service accounts rotate. Each one of those events has to invalidate live tokens immediately, not at the next expiry tick.

This post is about how we did it without giving up the performance properties that made JWTs worth using in the first place.

The constraints that ruled out the obvious answers

Three numbers shape the design:

50,000 RPS of authenticated requests.
Sub-millisecond auth budget on the hot path.
Single-digit-second propagation — when a user logs out, every pod must know within a few seconds.

The obvious approaches each break one of these:

Query Redis on every request. Adds a network round-trip to every auth decision. Median latency explodes. Redis also becomes a hard single point of failure — if it's slow or down, every request fails.

Push revocation events via websockets or long-poll to every pod. Works at low scale. Gets fragile when pods churn, restart, or drop events during a network blip.

Short-lived tokens with fast refresh. A 5-minute expiry reduces the window, but doesn't close it — and 5 minutes is too long when an account is disabled for a security reason.

What worked: a two-layer design.

Redis is the propagation layer. It holds the authoritative revocation state and a live event feed.
Local memory is the decision layer. Each pod keeps an in-memory map of revoked JTIs. The hot-path check is a single map lookup — no I/O.

Two Redis structures, one job each

Two Redis keys do the heavy lifting, and they serve different purposes — which is why both are necessary.

revoked_access_tokens is a sorted set. Each member is a JTI; the score is the token's expiry timestamp. This is the source of truth at any point in time — you can ask it "give me everything currently revoked" with a single range query.

revoked_access_token_events is a stream. Each entry carries the JTI, expiry, and metadata about the revocation. This is the live feed — pods subscribe to it and learn about new revocations as they happen.

The ZSET answers "what is the state right now?" The Stream answers "what has changed since I last checked?" You need both because they're good at different things: the ZSET is for bulk reads at startup, the Stream is for incremental updates during steady state.

The startup problem — and two races hiding in the obvious solution

When a pod boots, it needs to populate its local map before it serves traffic. The tempting approach: read the ZSET to get current revocations, then subscribe to the Stream for updates.

Two races hide here.

Race 1: What if a revocation arrives between the ZSET read and the Stream subscription? The event is in the Stream, but the pod's cursor is positioned after it. The JTI never makes it into the local cache.

Race 2: What if you start the Stream consumer from the very beginning (0-0) to avoid missing anything? Now you replay every event ever emitted — potentially thousands. Worse: if the stream has been trimmed, you'll silently miss events older than the trim window.

The fix is to reverse the order: capture the Stream tip before reading the ZSET, then start the consumer from that captured tip.

func (s *TokenRevocationService) WarmCache(ctx context.Context) error {
    // 1. Capture the stream tip first.
    tipID, err := s.captureStreamTip(ctx)
    if err != nil { return err }

    // 2. Read the current ZSET snapshot.
    now := time.Now().Unix()
    members, err := s.redis.ZRangeByScoreWithScores(ctx,
        "revoked_access_tokens",
        &redis.ZRangeBy{Min: strconv.FormatInt(now, 10), Max: "+inf"},
    ).Result()
    if err != nil { return err }

    s.mu.Lock()
    for _, m := range members {
        s.localCache[m.Member.(string)] = int64(m.Score)
    }
    s.mu.Unlock()

    // 3. Start the consumer at the tip captured before the ZSET read.
    // Anything that arrived between tipID and now replays through the consumer.
    go s.consumeStream(tipID)
    return nil
}

The order matters precisely. By capturing the tip first, anything that arrives while we're reading the ZSET will replay through the consumer. Anything already in the ZSET when we read it is loaded directly. If the same JTI appears in both — a revocation that landed right on the boundary — setting the same map entry twice is harmless.

The pod's lifecycle from boot to steady state:

The hot path: deliberately boring

The actual check on every request is about as simple as it gets:

func (s *TokenRevocationService) IsJTIRevoked(jti string) bool {
    s.mu.RLock()
    expiresAt, found := s.localCache[jti]
    s.mu.RUnlock()
    if !found { return false }
    if time.Now().Unix() > expiresAt { return false }
    return true
}

A read lock, a map lookup, a comparison. No Redis, no network. Hundreds of nanoseconds.

The !found → false branch is a deliberate fail-open choice: if a JTI isn't in the local cache, we treat it as not revoked. The risk is that a freshly revoked token might be accepted for the few seconds between the revocation being published and the local cache being updated. We accept that window. The alternative — failing closed — would mean denying every request whose JTI we haven't explicitly loaded, which at startup means denying all traffic until the cache is fully warm. That's worse.

The gap probe: catching what the Stream misses

The Stream consumer keeps a cursor — the ID of the last event it processed. Periodically, the stream gets trimmed to bound its size. If the consumer's cursor falls behind the trim window (because of a slow handler, a GC pause, or a network blip), the next XREAD will silently skip the trimmed events.

We detect this with a gap probe that runs every 5 minutes:

If the oldest event currently in the Stream is newer than the consumer's cursor, we missed something. When that happens, we resync from the ZSET (which is the authoritative source of truth and doesn't get trimmed the same way) and snap the cursor to the stream tip.

This probe has fired exactly twice in production since we added it — both times during planned Redis maintenance — and both times the recovery was automatic. The value isn't that it fires often. It's that without it, you'd never know you missed events at all.

Service accounts: the same idea, different risk tolerance

User token revocation fails open — a freshly revoked token might slip through for a few seconds. That's acceptable: the window is small, bounded, and observable.

Service-account rotation fails closed. When a service account is rotated, the old credentials must be denied immediately, even if that means a slightly degraded startup path.

The mechanism is different too: instead of JTI revocation, service accounts carry a version number. The gateway keeps a local map of current SA versions loaded from Redis. If the token's version is less than the current version for that service account, it's denied.

The pod won't pass readiness until this cache is loaded. If Redis is unavailable at startup, the pod doesn't serve traffic. That's intentional — we'd rather have fewer pods than pods that can't correctly enforce SA rotation.

The 60-second sync window is our exposure. We reduce the effective risk by having the rotating system hold the old version live for a grace period, only promoting the new version once enough gateways have synced.

What revocation doesn't do

A few things that seem natural but aren't in scope:

Revoke by user ID. The cache is JTI-indexed. To revoke all of a user's tokens, the issuer enumerates their live JTIs and revokes each one. The Auth Service sees only individual JTIs.

Cross-region propagation. We run regional auth services with regional Redis instances. Revocations published in one region don't automatically appear in another. Most revocations are tenant-bound, and tenants are region-bound, so this rarely matters in practice.

Shared Redis. This Redis instance is auth-only. The corner cases in revocation are complex enough that sharing infrastructure with rate limiters or session stores would make debugging much harder.

What we'd do differently on day one

Add the gap probe immediately. It's a small amount of code and it's the difference between "we silently lose a logout event occasionally" and "we always know when propagation breaks."

Test the warm path with a slow or unavailable Redis. Most bugs we found were in error handling during startup, not steady-state operation. The warm path runs once per pod lifetime; staging rarely exercises it unless you deliberately inject failures.

Bound everything from the start. The local cache, the stream length, the sync interval. Unbounded growth in any of them becomes an incident.

Next up: Chapter 8 — every cache in the hot path, together. JWT verify cache, RSA key cache, route cache, policy bitmap, revocation map, SA version map. Each one is fast individually; together they're how the gateway fits inside its latency budget. We'll cover TTL strategy, invalidation, and the one cache where we got eviction wrong.