<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eric Lytle</title>
    <description>The latest articles on DEV Community by Eric Lytle (@drickon).</description>
    <link>https://dev.to/drickon</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4006265%2F328a9f1f-81ec-47c1-9132-cca955411f1f.jpg</url>
      <title>DEV Community: Eric Lytle</title>
      <link>https://dev.to/drickon</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/drickon"/>
    <language>en</language>
    <item>
      <title>The token is valid — but your headless Claude Code agent just 401'd forever</title>
      <dc:creator>Eric Lytle</dc:creator>
      <pubDate>Sun, 28 Jun 2026 09:03:09 +0000</pubDate>
      <link>https://dev.to/drickon/the-token-is-valid-but-your-headless-claude-code-agent-just-401d-forever-48ip</link>
      <guid>https://dev.to/drickon/the-token-is-valid-but-your-headless-claude-code-agent-just-401d-forever-48ip</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; A static OAuth access token can return HTTP 200 on a raw &lt;code&gt;/v1/messages&lt;/code&gt; call at the exact instant a long-running Claude Code instance using that &lt;em&gt;same token&lt;/em&gt; gets 401 "Invalid authentication credentials" — because the rejection is bound to the instance's own server-side session identity, not the token. Worse, once it 401s the instance hard-latches and never self-recovers until you restart the process, so any "is the token valid?" probe is structurally blind to the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;We run several headless Claude Code instances on Linux — long-running, unattended (systemd services in our case). Authentication is a single static &lt;code&gt;CLAUDE_CODE_OAUTH_TOKEN&lt;/code&gt; environment variable: an &lt;code&gt;sk-ant-oat01…&lt;/code&gt; OAuth access token from a Claude Max subscription, minted with &lt;code&gt;claude setup-token&lt;/code&gt;. It has &lt;strong&gt;no refresh token&lt;/strong&gt;, and the instances never touch &lt;code&gt;~/.claude/.credentials.json&lt;/code&gt; (the rotating credential file). Auth is purely the static env token. We're on Claude Code v2.1.195, the latest stable as of this writing.&lt;/p&gt;

&lt;p&gt;Recurrently, an instance's model API calls start returning HTTP 401 ("Invalid authentication credentials" / the CLI shows "Please run /login"). Across our fleet over 2026-06-13..06-28 we logged &lt;strong&gt;212 distinct 401 windows / 245 request_ids&lt;/strong&gt; — roughly 8 per day fleet-wide. Windows last from seconds to ~125 minutes, rarely up to ~7 hours.&lt;/p&gt;

&lt;p&gt;The obvious diagnosis is "the token expired / got revoked." We chased that and found it's wrong. Here's what's actually happening, finding by finding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 1 — It's session-bound, not credential-bound
&lt;/h2&gt;

&lt;p&gt;This is the non-obvious one, so lead with it.&lt;/p&gt;

&lt;p&gt;During a &lt;em&gt;live&lt;/em&gt; wedge — an instance actively returning 401 on its own turns — we fired raw &lt;code&gt;POST https://api.anthropic.com/v1/messages&lt;/code&gt; using the &lt;strong&gt;same static &lt;code&gt;oat01&lt;/code&gt; token the wedged instance uses&lt;/strong&gt;. We tried it in many shapes: minimal; agent-shaped; large cache-creation; streaming; 12 tools; with metadata; resumed-style. &lt;strong&gt;Every one returned HTTP 200 at the same instant the wedged instance's own turns returned 401.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The token is valid. The account is fine. The request shape, size, model, and source IP are all fine — the raw probe shares all of them and succeeds. The only thing the probe does &lt;em&gt;not&lt;/em&gt; share is the wedged instance's own long-lived server-side session/process identity.&lt;/p&gt;

&lt;p&gt;Conclusion: &lt;strong&gt;the rejection is bound to the instance's own server-side session identity&lt;/strong&gt; — not the token, not the request, not the account.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 2 — A hard client-side latch on a still-valid token
&lt;/h2&gt;

&lt;p&gt;Across &lt;strong&gt;412 sessions / 153 distinct 401 events&lt;/strong&gt;, the number that self-recovered without a process restart was &lt;strong&gt;zero&lt;/strong&gt;. Even after the upstream rejection window closes — even after a raw probe on that token is happily returning 200 — the instance stays latched until you restart it.&lt;/p&gt;

&lt;p&gt;Note what this rules out. We're on v2.1.195, which already ships Anthropic's v2.1.117 "reactive token refresh on 401" fix and the v2.1.178 "stale cached request configuration" fix. It still latches. That's consistent with Finding 1: re-minting or refreshing the token cannot help when the rejection is bound to session identity rather than to the token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 3 — Token probes are structurally blind
&lt;/h2&gt;

&lt;p&gt;This follows directly. Any external "is the token valid?" probe shares the token but &lt;strong&gt;not&lt;/strong&gt; the wedged session identity, so it returns 200 throughout the entire outage. "Token is valid" tells you nothing about whether the instance is latched.&lt;/p&gt;

&lt;p&gt;This is the single most important operational lesson here: &lt;strong&gt;never gate recovery on a token probe.&lt;/strong&gt; A green probe and a dead agent coexist happily. We verify recovery only by observing an actual non-401 turn from the instance itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 4 — A separate big-model-tier 429 masquerade
&lt;/h2&gt;

&lt;p&gt;Flag this as &lt;em&gt;distinct&lt;/em&gt; from the 401 latch — it's a different failure that's easy to conflate.&lt;/p&gt;

&lt;p&gt;In one ~7-hour outage, direct probes showed &lt;strong&gt;Opus 4.8 and Sonnet 4.6 returning HTTP 429 &lt;code&gt;rate_limit_error&lt;/code&gt;&lt;/strong&gt; (a generic "Error" body, &lt;code&gt;x-should-retry: true&lt;/code&gt;, no &lt;code&gt;retry-after&lt;/code&gt; header) while &lt;strong&gt;Haiku returned HTTP 200 on the same token&lt;/strong&gt;. This was not a usage cap: the 5-hour cap was ~10% used and had reset ~3 hours earlier, and the condition persisted through more than 5 hours of idle.&lt;/p&gt;

&lt;p&gt;The trap: a naive probe that hits Haiku reads 200 and reports "token fine," completely missing a big-model-tier throttle. If you're going to probe at all (and per Finding 3, be skeptical), probe the tier you actually run on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Idle-wake skew (unproven)
&lt;/h2&gt;

&lt;p&gt;One more pattern, marked unproven because the mechanism isn't established. Rebuilding 54 genuine 401 episodes from session logs, &lt;strong&gt;idle-wake episodes (&amp;gt;1h idle) were 71% morning vs. mid-use episodes (≤1h idle) at 0% morning&lt;/strong&gt;. That's suggestive that the server-side session identity may go stale after a long idle period. It's real but a minority of episodes, and we have not proven the mechanism — treat it as a lead, not a conclusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Independent corroboration
&lt;/h2&gt;

&lt;p&gt;This isn't just our fleet. GitHub &lt;code&gt;anthropics/claude-code&lt;/code&gt; #61912 captured the &lt;strong&gt;same token returning 200 on &lt;code&gt;/oauth/hello&lt;/code&gt; and 401 on &lt;code&gt;/v1/messages&lt;/code&gt; in the same second&lt;/strong&gt;, token unexpired — the same session-bound, probe-blind phenomenon. (That report attributes it to credential-file corruption, which can't apply here: our token is static with no refresh and the instances never read the credential file.)&lt;/p&gt;

&lt;h2&gt;
  
  
  What we do about it: verify by outcome, back off on a quiet window
&lt;/h2&gt;

&lt;p&gt;Our mitigation is a watchdog with two design choices worth stealing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detect the 401 in the instance's own logs&lt;/strong&gt;, then restart the wedged instance. A restart is the only thing that clears the latch (Finding 2).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify recovery by an &lt;em&gt;observed non-401 turn&lt;/em&gt; — never by a token probe.&lt;/strong&gt; Per Finding 3, the probe is blind; the only trustworthy signal that an instance is healthy is the instance itself producing a successful turn. For a session-bound failure, "is the credential valid?" is simply the wrong question — validity and health are decoupled.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The third design choice that matters: a &lt;strong&gt;quiet-window backoff&lt;/strong&gt;. The upstream rejection window can stay open for many minutes. If the watchdog restarts on a fixed short interval, it just restart-storms &lt;em&gt;into&lt;/em&gt; a still-open window and churns. So it backs off, giving the upstream window time to close before the next recovery attempt, and confirms by outcome rather than by a clock.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we think Anthropic should change
&lt;/h2&gt;

&lt;p&gt;We're characterizing the failure precisely, not claiming we know its upstream root cause. Two asks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The client latch shouldn't outlive the upstream window.&lt;/strong&gt; On v2.1.195 it does — once an instance 401s, it stays dead until restarted even after a raw probe on the same token returns 200. A session-identity 401 needs the client to re-establish session state, not merely refresh the token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Token valid but the session 401s" needs a real fix, or at least an actionable error.&lt;/strong&gt; Today the CLI surfaces a dead-end "Please run /login," which is a dead end when the token is demonstrably valid.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A few request_ids for tracing (all HTTP 401 &lt;code&gt;authentication_failed&lt;/code&gt;, token valid throughout):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;req_011CcVDWWs8GPfDyX8R9LEfW&lt;/code&gt; (2026-06-28 01:52 CDT)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;req_011CcVDW3MetrtoQqLU2m8cn&lt;/code&gt; (2026-06-28 01:52 CDT)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;req_011CcUaNDekFKPWogaeZ9adT&lt;/code&gt; (2026-06-27 17:45 CDT)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;req_011Cc3X8oSfApMWRCs66taQw&lt;/code&gt; (2026-06-14 12:10 CDT)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run headless Claude Code agents and have seen the silent-death-after-401 pattern, the takeaways are: restart clears it, the token was never the problem, and a token probe will read green the entire time. Build your watchdog to verify by outcome, and back off so you don't restart into an open window.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>llmops</category>
      <category>devops</category>
      <category>debugging</category>
    </item>
  </channel>
</rss>
