I pinned each issuer's public key. Then the IdP rotated it.

#ai #mcp #oauth #security

Last time I wrote about keeping the human provable when an agent's delegation chain crosses from one company's identity provider into another's. The verifier walks the chain backward and checks each segment against the key of the issuer that signed it. I ended that post with a line I meant as honesty about the one assumption left standing:

You pick your trusted issuers once, explicitly, in an object you can read.

That sentence quietly skipped a question. The keys have to get into that object somehow. I had them going in as static PEM strings, pinned by hand. It works in a demo. It breaks the first time a real identity provider does the most ordinary thing an identity provider does.

Pinning is fine until it isn't

Here is what the trust set looked like. A little JSON manifest, issuer to public key, loaded off disk:

{
  "https://idp-a.local": "-----BEGIN PUBLIC KEY-----\nMIIBIjANBg...",
  "https://idp-b.local": "-----BEGIN PUBLIC KEY-----\nMIIBIjANBg..."
}

The verifier reads that, and now it can check any token either issuer signed. Clean. Operator-independent, too, which is the whole point of Crumb: those keys came from you, out of band, not from whoever is holding the log you're auditing. Nobody in the middle gets to assert their own trustworthiness.

The problem is the word "static." A signing key is not a fact about an issuer. It's a thing an issuer rotates, on a schedule, as basic hygiene, the same way you rotate any other secret. Okta rotates. Keycloak and Auth0 do too, on their own schedules. When they do, the PEM you pinned three weeks ago is now a key nobody signs with anymore.

And it fails in the worst way, which is quietly. Nothing is broken at the moment you deploy. Weeks later the issuer rotates, tokens start arriving signed by a key your manifest has never heard of, and every one of them fails signature verification. Not because anything is forged. Because your copy of reality went stale and nobody told it. The fix is a human noticing, editing a file, and redeploying. That is not a verification system. That is a verification system with a standing appointment to break.

Fetch the key, don't freeze it

The standard already solved this, and I was just not using the part that mattered. An OIDC issuer publishes its current signing keys at a JWKS endpoint, and it tells you where that endpoint is in its discovery document. Every token names, in its header, the exact key it was signed with, by kid.

So the verifier stops pinning keys and starts pinning issuers. You name the issuer you accept. The verifier reads the issuer's /.well-known/openid-configuration, finds its jwks_uri, fetches the keys there, and picks the one whose kid matches the token in hand.

{
  "https://idp-a.local": { "discovery": "https://idp-a.local" },
  "https://idp-b.local": "https://idp-b.local/jwks"
}

Rotation stops being an event the verifier has to be told about. A token shows up signed by a kid the verifier hasn't cached, and instead of failing, the source refetches the JWKS exactly once, finds the new key sitting right there where the issuer just published it, and verifies. No redeploy. No file edit. The issuer rotated and the verifier followed, because the verifier was reading from the issuer the whole time instead of from a photograph of it.

The one refetch matters, by the way. You cache, or every token turns into a network round trip. But you can't cache so hard that a rotation locks you out. Unknown kid is the signal to look again, once, before giving up. Seen kid comes straight from cache.

Fetching is not trusting

Now the part I had to be careful about, because it is exactly where this could quietly betray the whole premise.

Crumb exists to let someone verify who directed an action without trusting the operator who holds the log. If I make the verifier go fetch keys over the network, the obvious question is: fetch them from where? Get that wrong and you've handed the trust decision to whoever answers the request.

The keys come from the issuer's own endpoint, over TLS, which is what authenticates that you're really talking to idp-a.local and not someone wearing its name. They never come from the server under audit. That server is the one thing in the picture you have assumed might be lying to you. It holds the ledger it wants believed. It does not get to also supply the keys that would prove the ledger honest. The two roles stay split.

And the verifier still decides which issuers count. Fetching a key from an endpoint is not the same as trusting it. An issuer you never named has a JWKS endpoint too. That buys it nothing. The verifier only reaches for keys at issuers it already put in its own trust set, out of band, the same explicit decision as before. All that changed is that the trusted thing is now an issuer identity instead of one frozen key, and TLS carries the weight of "is this really that issuer."

So the two ways it can refuse stay distinct, and I kept them named:

UntrustedIssuer. The token's issuer is not in the trust set at all. There is no endpoint to even ask. Refused outright.
UnknownSigningKey. The issuer is trusted, but none of its published keys match the token's kid, even after a fresh fetch. The issuer is real; the key it claims to have signed with does not exist. Refused, not guessed.

The first is a stranger. The second is a trusted party holding up a key that isn't theirs. Collapsing those two into one "nope" would throw away the only information a debugger actually wants.

Rotation's quiet twin: revocation

Rotation is a key showing up. There is a nastier version of the same problem, which is a key going away. An issuer's signing key gets compromised, and the issuer pulls it from its JWKS to kill it. Every token still floating around signed by that key should stop verifying, now.

Fetching by kid does not, on its own, get you this. A verifier that fetched a key once and cached it will keep honoring that kid forever, because a kid it has already seen never sends it back to look again. Rotation is handled by the refetch on the unseen kid. Revocation is about a kid you have very much seen and should stop trusting. The cache that made rotation cheap is exactly what keeps a dead key alive.

So a fetched key gets a shelf life. It is trusted for a bounded window, a TTL, and once that lapses the cache is stale and has to be reconfirmed against the live JWKS before it is served again. When the issuer has dropped the compromised key, the reconfirm comes back without it, and tokens signed by it start failing. Revocation lands within one TTL window rather than never.

The reconfirm has to fail closed, and this is the part worth slowing down on. If the JWKS fetch fails and the verifier shrugs and serves its stale cache anyway, then anyone who can stall that fetch just bought the revoked key an extension for as long as they can keep the endpoint unreachable. That is the whole revocation, handed back to the attacker through the availability door. So a stale cache that can't be reconfirmed refuses. The cost is that the issuer's JWKS being reachable is now part of your verification path, which is the honest price of checking against a live issuer instead of a photograph of one.

What I am not going to oversell

TLS is doing real load-bearing work now, and I should say that out loud instead of letting it hide. "The keys come from the issuer's own endpoint" is only as true as your certificate validation. Point this at an issuer over plain HTTP, or disable cert checks because a test was annoying, and the trust boundary I just drew has a hole straight through it. In the demo below the issuers run on localhost over plain HTTP, which is fine for showing the mechanism and would be a real hole in production. The honest claim is "keys fetched from the issuer over an authenticated channel," and the authenticated part is a requirement, not a decoration.

The revocation window is a window, not an instant. A key stays honored for up to a TTL after the issuer kills it. Tighten the TTL and revocation lands faster, at the cost of more fetches; the genuinely instant version wants push invalidation, not polling, and I haven't built that. There's also no retry or backoff around a flapping issuer yet, just a plain timeout and a fail-closed refusal. A verifier that fetches is a verifier that can be made to wait, and a production one wants more grace around that than this has. Real, and not done.

What is done is the thing that was actually broken: the trust set no longer freezes a key it has no business freezing. You name the issuers. Their keys stay theirs, fetched live, followed through rotation, and never once sourced from the party you're auditing.

Try it

git clone https://github.com/AlexlaGuardia/crumb
python -m crumb.jwks_federation_demo

It stands up two identity providers on real ports, each serving its own discovery and JWKS endpoints. A verifier that pinned nothing names the two issuers, fetches their keys live, and verifies a delegation chain back to the human. Then one issuer rotates its signing key, and the same verifier follows the rotation with a single refetch and keeps verifying. It revokes a key and shows a token signed by it still passing inside the TTL window, then getting refused once the cache reconfirms and the key is gone. And a chain built on an issuer the verifier never named gets refused, because a live endpoint was never what earned trust.

The rest of Crumb, including the cross-issuer stapling this builds on, is at crumb.alexlaguardia.dev.

If you work on agent identity and you see a way the fetch path hands trust back to the wrong party, that's exactly the hole I want pointed out.