The token is valid, but your headless Claude Code agent just 401'd forever

Eric Lytle — Sun, 28 Jun 2026 09:03:09 +0000

TL;DR: A static OAuth access token can return HTTP 200 on a raw /v1/messages call at the exact instant a long-running Claude Code instance using that same token gets 401 "Invalid authentication credentials," because the rejection is bound to the instance's own server-side session identity, not the token. Worse, once it 401s the instance hard-latches and never self-recovers until you restart the process, so any "is the token valid?" probe is structurally blind to the problem.

The setup

We run several headless Claude Code instances on Linux, long-running and unattended (systemd services in our case). Authentication is a single static CLAUDE_CODE_OAUTH_TOKEN environment variable: an sk-ant-oat01… OAuth access token from a Claude Max subscription, minted with claude setup-token. It has no refresh token, and the instances never touch ~/.claude/.credentials.json (the rotating credential file). Auth is purely the static env token. We're on Claude Code v2.1.195, the latest stable as of this writing.

Recurrently, an instance's model API calls start returning HTTP 401 ("Invalid authentication credentials" / the CLI shows "Please run /login"). Across our fleet over 2026-06-13..06-28 we logged 212 distinct 401 windows / 245 request_ids, roughly 8 per day fleet-wide. Windows last from seconds to ~125 minutes, rarely up to ~7 hours.

The obvious diagnosis is "the token expired / got revoked." We chased that and found it's wrong. Here's what's actually happening, finding by finding.

Finding 1: It's session-bound, not credential-bound

This is the non-obvious one, so lead with it.

During a live wedge, with an instance actively returning 401 on its own turns, we fired raw POST https://api.anthropic.com/v1/messages using the same static oat01 token the wedged instance uses. We tried it in many shapes: minimal; agent-shaped; large cache-creation; streaming; 12 tools; with metadata; resumed-style. Every one returned HTTP 200 at the same instant the wedged instance's own turns returned 401.

The token is valid. The account is fine. The request shape, size, model, and source IP are all fine. The raw probe shares all of them and succeeds. The only thing the probe does not share is the wedged instance's own long-lived server-side session/process identity.

Conclusion: the rejection is bound to the instance's own server-side session identity, not the token, not the request, not the account.

Finding 2: A hard client-side latch on a still-valid token

Across 412 sessions / 153 distinct 401 events, the number that self-recovered without a process restart was zero. Even after the upstream rejection window closes, even after a raw probe on that token is happily returning 200, the instance stays latched until you restart it.

Note what this rules out. We're on v2.1.195, which already ships Anthropic's v2.1.117 "reactive token refresh on 401" fix and the v2.1.178 "stale cached request configuration" fix. It still latches. That's consistent with Finding 1: re-minting or refreshing the token cannot help when the rejection is bound to session identity rather than to the token.

Finding 3: Token probes are structurally blind

This follows directly. Any external "is the token valid?" probe shares the token but not the wedged session identity, so it returns 200 throughout the entire outage. "Token is valid" tells you nothing about whether the instance is latched.

This is the single most important operational lesson here: never gate recovery on a token probe. A green probe and a dead agent coexist happily. We verify recovery only by observing an actual non-401 turn from the instance itself.

Finding 4: A separate big-model-tier 429 masquerade

Flag this as distinct from the 401 latch. It's a different failure that's easy to conflate.

In one ~7-hour outage, direct probes showed Opus 4.8 and Sonnet 4.6 returning HTTP 429 rate_limit_error (a generic "Error" body, x-should-retry: true, no retry-after header) while Haiku returned HTTP 200 on the same token. This was not a usage cap: the 5-hour cap was ~10% used and had reset ~3 hours earlier, and the condition persisted through more than 5 hours of idle.

The trap: a naive probe that hits Haiku reads 200 and reports "token fine," completely missing a big-model-tier throttle. If you're going to probe at all (and per Finding 3, be skeptical), probe the tier you actually run on.

Idle-wake skew (unproven)

One more pattern, marked unproven because the mechanism isn't established. Rebuilding 54 genuine 401 episodes from session logs, idle-wake episodes (>1h idle) were 71% morning vs. mid-use episodes (≤1h idle) at 0% morning. That's suggestive that the server-side session identity may go stale after a long idle period. It's real but a minority of episodes, and we have not proven the mechanism, so treat it as a lead, not a conclusion.

Independent corroboration

This isn't just our fleet. GitHub anthropics/claude-code #61912 captured the same token returning 200 on /oauth/hello and 401 on /v1/messages in the same second, token unexpired: the same session-bound, probe-blind phenomenon. (That report attributes it to credential-file corruption, which can't apply here: our token is static with no refresh and the instances never read the credential file.)

What we do about it: verify by outcome, back off on a quiet window

Our mitigation is a watchdog with two design choices worth stealing:

Detect the 401 in the instance's own logs, then restart the wedged instance. A restart is the only thing that clears the latch (Finding 2).
Verify recovery by an observed non-401 turn, never by a token probe. Per Finding 3, the probe is blind; the only trustworthy signal that an instance is healthy is the instance itself producing a successful turn. For a session-bound failure, "is the credential valid?" is simply the wrong question. Validity and health are decoupled.

The third design choice that matters: a quiet-window backoff. The upstream rejection window can stay open for many minutes. If the watchdog restarts on a fixed short interval, it just restart-storms into a still-open window and churns. So it backs off, giving the upstream window time to close before the next recovery attempt, and confirms by outcome rather than by a clock.

What we think Anthropic should change

We're characterizing the failure precisely, not claiming we know its upstream root cause. Two asks:

The client latch shouldn't outlive the upstream window. On v2.1.195 it does: once an instance 401s, it stays dead until restarted even after a raw probe on the same token returns 200. A session-identity 401 needs the client to re-establish session state, not merely refresh the token.
"Token valid but the session 401s" needs a real fix, or at least an actionable error. Today the CLI surfaces a dead-end "Please run /login," which is a dead end when the token is demonstrably valid.

A few request_ids for tracing (all HTTP 401 authentication_failed, token valid throughout):

req_011CcVDWWs8GPfDyX8R9LEfW (2026-06-28 01:52 CDT)
req_011CcVDW3MetrtoQqLU2m8cn (2026-06-28 01:52 CDT)
req_011CcUaNDekFKPWogaeZ9adT (2026-06-27 17:45 CDT)
req_011Cc3X8oSfApMWRCs66taQw (2026-06-14 12:10 CDT)

If you run headless Claude Code agents and have seen the silent-death-after-401 pattern, the takeaways are: restart clears it, the token was never the problem, and a token probe will read green the entire time. Build your watchdog to verify by outcome, and back off so you don't restart into an open window.

DEV Community: Eric Lytle