DEV Community

Cover image for We rotated our JWKS without overlap. Here is the 4-minute window that broke prod.
Blue Hills
Blue Hills

Posted on • Originally published at jwtshield.com

We rotated our JWKS without overlap. Here is the 4-minute window that broke prod.

The on-call alert at 02:14 said auth_5xx_rate spiked from 0.01 to 31.4. Not a deploy window. Not a traffic spike. Just thirty-one percent of authenticated requests failing for ~four minutes, then back to baseline.

The cause was a JWKS rotation on the issuer side. New keys came in. Old keys went out. Caches in our service didn't refresh fast enough. Tokens signed with the new key were rejected because the verifier still held the old JWKS. Tokens signed with the old key were rejected because the issuer had stopped publishing them. We had a key-overlap gap of roughly four minutes between when the issuer stopped issuing tokens with the old key and when our verifier's cache picked up the new one.

This is a class of bug that does not show up in any of the tests we run. Unit tests use a fixture JWKS that never rotates. Integration tests use a mocked issuer. Synthetic monitoring hits the live issuer but uses tokens minted within the same minute, so cache freshness is irrelevant. The bug only shows up in the seam between the issuer's rotation cadence and the verifier's cache TTL — a seam that exists only in production.

The mechanic

Identity providers rotate signing keys for two reasons: scheduled rotation (typically 24h to 30 days, depending on policy) and incident rotation (compromise suspected). The standard practice is overlap — publish the new key in the JWKS endpoint several hours before issuing tokens with it, so every consumer's cache has time to refresh before the first token signed with the new key arrives.

The overlap window has to be longer than the longest cache TTL across all consumers. Most JWKS-fetching libraries default to a 10-minute TTL. Some are 1 hour. Some hardcode a 24-hour cache and don't expose a refresh hook at all. If your overlap is shorter than your slowest consumer's TTL, you will see exactly what we saw: a brief window where new tokens are unverifiable because the consumer hasn't picked up the new key yet.

Our issuer's overlap was 4 hours. The consumer with the slowest cache was a service we hadn't touched in six months, running an older version of node-jose with a 24-hour TTL. The first token signed with the new key arrived 4 hours and 12 seconds after the rotation announcement. Cache TTL hadn't expired yet. 401s for the rest of the cache window.

The reproduction

Spin up a JWKS server. Sign a token with key A. Verify it. Rotate the JWKS endpoint to key B with no overlap. Sign a new token with key B. Try to verify with the cached JWKS:

ERR: kid 'b1' not found in JWKS
ERR: signature verification failed
Enter fullscreen mode Exit fullscreen mode

That is the exact production error, just on a laptop. The painful part of the production version is that you cannot fix it from inside the verifier — by the time the alert pages, the rotation is already happening, and the only mitigation is to wait for the cache TTL to expire across the fleet.

What we actually did

After the postmortem, three things changed:

1. Fail loud, not silent. We added explicit logging when the cache misses on a kid lookup, and when JWKS refresh returns a different fingerprint than the cached one. The 4-minute window had been silent in our logs because the JWT library swallowed the cache miss and returned a generic "signature invalid" error. We could not tell from logs alone that this was a rotation problem.

2. Forced refresh on kid miss. Most JWT libraries don't refresh the JWKS when they hit a kid they don't recognize — they fail closed and return an error. We patched our wrapper to force a JWKS refetch on the first kid miss, then retry verification once. This shortens the window from "wait for cache TTL" to "one refetch round-trip" — sub-second for most clients.

3. We added a CI check. Every deploy now runs a JWKS rotation classifier against our issuer. If the issuer's current JWKS is disjoint from the previous JWKS we recorded — meaning no key in the previous set is in the current set — the CI step fails the build with a structured finding. This is the check that would have caught the issuer-side overlap gap before anyone shipped tokens against it.

The check we use is jwtshield's /v1/validate/jwks-rotation endpoint. It accepts a previous JWKS, a current JWKS, an optional sample token, and an optional overlap policy. It returns one of no_change, safe_overlap, overlap, or disjoint. disjoint means: every key has changed, no overlap window. That is the failure mode.

curl -X POST https://api.jwtshield.com/v1/validate/jwks-rotation \
  -H "Authorization: Bearer $JWTSHIELD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "previous_jwks": <last-known good>,
    "current_jwks":  <freshly fetched>,
    "overlap_policy": { "min_overlap_count": 1 }
  }'
Enter fullscreen mode Exit fullscreen mode

In CI, the redbullhorns/jwtshield-ci@v1 Action wraps this into a 5-line GHA step:

- uses: redbullhorns/jwtshield-ci@v1
  with:
    issuer: https://login.example.com
    audience: api://backend
    fail-on-severity: high
    api-key: ${{ secrets.JWTSHIELD_API_KEY }}
Enter fullscreen mode Exit fullscreen mode

The step fails the build if the rotation classifier returns disjoint (no overlap), or if the overlap policy you set is violated. If the issuer is well-behaved and overlaps are 24h+, you never see this fire. The day someone forgets to set the overlap correctly, you find out before tokens ship.

What we learned

JWKS rotation is one of those bugs that is invisible until it ruins your night. Test fixtures don't replicate the seam. Mocked issuers don't either. The bug lives in the gap between the issuer's rotation policy and your verifier's cache TTL, and it stays invisible until the gap is shorter than the cache.

The postmortem item that made the difference wasn't the cache fix or the logging change — those were good, but they shorten the blast radius once the bug fires. The CI check is what stops the bug from firing in the first place.

If you've shipped a JWT validator in the last five years and you have not run jwks-rotation against your issuer's last 30 days of JWKS publications, you have at least one of these gaps in your fleet right now.

Read the full incident-class catalogue: Three JWT bugs that ship to prod silently.

Free tier: 200 verifies per month, all algorithms, no card. Get an API key →


Discuss on: Hacker News · dev.to · Hashnode · Mastodon

Related:

Top comments (0)