MCP Credential Lifecycle: What Happens When Your Tokens Expire in Production

#mcp #security #api #devops

Most MCP server operators discover their token management strategy the hard way: at 2am, when an agent's tool calls start failing with auth errors and the server logs show nothing useful about why.

Credential lifecycle isn't glamorous. It's not something you add to a README. But it's the operational difference between a server that's reliable and one that fails silently whenever a token quietly expires.

The Problem

When an MCP server connects to upstream APIs, it holds credentials. Those credentials expire. Sometimes predictably (OAuth tokens, 1h/24h/90d depending on provider). Sometimes not — API keys revoked by admin action, NPM publish tokens that expire because the registry rotates them, service account credentials rotated by a security policy.

An MCP server that doesn't handle expiry proactively is a server that waits for an agent to call a tool, get a failed response, and then... what? Retry? Fail the whole workflow? Surface an opaque error the orchestrator can't route?

GitHub issue #3061 in modelcontextprotocol/servers is a clean example: a server dependency was tied to an NPM token that expired, and the server stopped working silently. The fix was mechanical — rotate the token — but symptom discovery was the real problem: no alert, no structured error, no machine-detectable expiry signal.

Five Credential Lifecycle Failures

1. Silent expiry — no pre-flight check

Token expires. First indication: a tool call fails. The error is often 401 Unauthorized, but the MCP server may not propagate this clearly to the orchestrator. The agent sees a tool failure, not an auth failure.

Fix: Pre-flight expiry checks before accepting traffic. If an upstream token expires in under 60 minutes, the server should signal degraded capacity before the first call fails, not after.

2. No expiry metadata on upstream errors

Even when a token expires and the upstream returns WWW-Authenticate headers or a machine-readable error body, the MCP server often swallows this and returns a generic tool error. The orchestrator has no way to distinguish "this API is down" from "our credential expired."

Fix: Surface auth-failure signals as structured tool errors with a type field — credential_expired, credential_revoked, rate_limit_reached — so the orchestrator can route intelligently.

3. Single-point credential storage

If the MCP server holds a single long-lived API key, that key is the blast radius. If it leaks, if it expires, if the provider revokes it: total tool unavailability.

Fix: Scope credentials per use-case. If the server needs read access and write access, use separate keys. Rotation of one doesn't take down the other.

4. No rotation path

An API key that was created manually and lives in an env file has no automated rotation path. When the provider deprecates it — or your security team asks for proof of rotation — the answer is "we'll update the env file manually." That's not a lifecycle. It's a credential graveyard.

Fix: Design credentials as things that rotate. Prefer provider-supported rotation APIs (Stripe has this; most enterprise APIs do). Fail loudly when a credential is approaching a forced rotation deadline.

5. Missing revocation response

Upstream providers increasingly revoke credentials proactively: security incidents, key age policies, billing events. A 401 from a revoked key looks identical to a 401 from an expired token — but the correct action is different. Expired token → refresh or rotate. Revoked key → manual intervention required.

Fix: Distinguish revocation from expiry in your error handling. Provider error codes often distinguish these (invalid_api_key vs token_expired). Surface the distinction to the orchestrator.

What Good Looks Like

AN Score tracks access readiness as a first-class evaluation dimension. Here's what production-grade credential lifecycle looks like across the operational timeline:

Before first call:

Load credentials from a managed secrets store (not an env file)
Validate token expiry at server startup; fail loudly if any token expires in under N minutes
For OAuth flows: verify the refresh token is valid and can acquire a new access token without human interaction

During operation:

Track auth errors by type (expired / revoked / rate-limited / scope-insufficient)
Surface structured errors to orchestrators so agents can make routing decisions
For long-running servers: proactively refresh tokens before they expire, not after

Rotation events:

Handle token rotation without server restart; reload credentials from the secrets store on rotation signal
Notify orchestrators of a brief degraded window if needed
Log rotation events with enough metadata to audit (timestamp, trigger, scope, upstream)

Revocation events:

Distinguish revocation from expiry
Page on-call or send a structured alert immediately — revocations are not self-healing
Preserve tool call logs up to the revocation event for incident review

Provider Context: Why This Varies

AN Score data shows meaningful variation in how much work providers do for you:

Stripe (8.1 L4): Restricted API keys with granular permissions, visible in dashboard, revocable by scope. Expiry is operator-controlled (no forced rotation). Clear, structured 401 error bodies.

GitHub (7.6 L3): Fine-grained PATs with explicit expiry dates, machine-readable X-OAuth-Scopes headers, granular scope control. Token expiry is surfaced in API responses before it happens.

HubSpot (4.6 L2): OAuth tokens expire every 30 minutes; refresh is available but the flow is non-trivial. Private app tokens are long-lived but lack granular scope control. Auth errors are not always structured.

Enterprise SaaS generally: Access tokens are short-lived, credential rotation is admin-gated, and the path from expired token to fresh credential often requires a human. Design your MCP server to surface this clearly rather than waiting for a tool call to fail.

The Observability Connection

Last week's post on MCP observability covers the logging side of this. Credential lifecycle events should appear in your audit log:

Token acquired: source=secrets-manager, scope=read:orders, expires_at=2026-04-04T08:00:00Z
Token refreshed: trigger=pre-flight, expires_in=3600
Token expiry warning: expires_in=300, tool=order-lookup
Token expired: tool-calls-paused, upstream=payments-api
Token revocation detected: action=alert-sent, tool=order-lookup

Without this trace, an operator reviewing a production incident has no idea when credential state changed relative to tool call failures.

Checklist: Credential Lifecycle for MCP Servers

[ ] All credentials loaded from a managed secrets store at runtime (not baked into env or config)
[ ] Pre-flight expiry check runs at server startup; surfaces a clear error if any credential expires in under 60 minutes
[ ] Credential type, scope, and expiry metadata logged at acquisition time
[ ] Auth failures surface structured errors (credential_expired, credential_revoked) — not generic tool failures
[ ] OAuth refresh path tested without human interaction in staging
[ ] Token rotation does not require server restart
[ ] Revocation is distinguished from expiry in error handling
[ ] Revocation triggers an alert (not just a log line)
[ ] All credential lifecycle events appear in the audit log

Closing

Credential lifecycle is one of those production details that's boring until it isn't. The NPM token expiry issue is a simple example — a dependency credential expired, the server stopped working. But the general pattern applies to every MCP server that treats auth as a one-time setup task instead of an ongoing operational surface.

If you're evaluating MCP servers for production use, the credential lifecycle question is simple: what happens when a token expires at 2am? If the answer is "the agent finds out when a tool call fails," that's a gap worth addressing before the first production incident.

This is part of an ongoing series on production-grade MCP operator patterns. Previous posts: Production Readiness Checklist · Prompt Injection and Scope Constraints · Multi-Tenant MCP Design · MCP Observability

Comparing which upstream APIs have the best credential lifecycle support out of the box? The AN Score access readiness dimension covers this across 600+ services.