<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Takayuki Kawazoe</title>
    <description>The latest articles on DEV Community by Takayuki Kawazoe (@zoetaka38).</description>
    <link>https://dev.to/zoetaka38</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3902826%2F0187a85d-f9a1-45bb-871d-bf5e49ddcccc.jpeg</url>
      <title>DEV Community: Takayuki Kawazoe</title>
      <link>https://dev.to/zoetaka38</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zoetaka38"/>
    <language>en</language>
    <item>
      <title>"How a headless CLI logs in: implementing OAuth Device Code Flow for an MCP client"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Tue, 09 Jun 2026 07:42:51 +0000</pubDate>
      <link>https://dev.to/zoetaka38/how-a-headless-cli-logs-in-implementing-oauth-device-code-flow-for-an-mcp-client-5a1l</link>
      <guid>https://dev.to/zoetaka38/how-a-headless-cli-logs-in-implementing-oauth-device-code-flow-for-an-mcp-client-5a1l</guid>
      <description>&lt;p&gt;When you connect an MCP server to your own service, one unglamorous problem shows up fast: how does the CLI log in?&lt;/p&gt;

&lt;p&gt;A web app with a browser can use the OAuth authorization code flow — redirect the user to a login page, exchange the returned code for a token. But MCP clients often run where there's no GUI browser: over SSH, in a CI container, on a headless box. The loopback trick (&lt;code&gt;http://localhost:random_port&lt;/code&gt; as the redirect target) doesn't help either, because there's no browser to open.&lt;/p&gt;

&lt;p&gt;OAuth has a proper answer for "authenticate a user where there's no browser": RFC 8628, the Device Authorization Grant, a.k.a. Device Code Flow. I implemented it in Codens' Auth service, so here's the design and the real code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea: separate where you authenticate from where you approve
&lt;/h2&gt;

&lt;p&gt;Device Code Flow splits the "device that shows a code" (the CLI) from the "device that approves" (your everyday browser). It's the same thing as logging into Netflix on a TV: a code appears on screen, you type it on your phone.&lt;/p&gt;

&lt;p&gt;The flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The CLI calls &lt;code&gt;/oauth/device/authorize&lt;/code&gt; and gets back a &lt;code&gt;device_code&lt;/code&gt; (the machine's secret) and a &lt;code&gt;user_code&lt;/code&gt; (a short code a human types).&lt;/li&gt;
&lt;li&gt;The CLI shows the user "open this URL and enter ABCD-EFGH", then starts polling &lt;code&gt;/oauth/device/token&lt;/code&gt; in the background.&lt;/li&gt;
&lt;li&gt;The user opens the verification page in their normal browser, already logged in, enters the &lt;code&gt;user_code&lt;/code&gt;, and approves.&lt;/li&gt;
&lt;li&gt;The moment it's approved, the CLI's poll receives the token.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The CLI never opens a browser. The user approves from whatever browser they already have — phone, another laptop, anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Endpoint 1: device/authorize
&lt;/h2&gt;

&lt;p&gt;The CLI calls this first. It takes a &lt;code&gt;client_id&lt;/code&gt; and &lt;code&gt;scope&lt;/code&gt; and issues the two codes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/device/authorize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DeviceAuthorizationResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;device_authorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Form&lt;/span&gt;&lt;span class="p"&gt;(...),&lt;/span&gt;
    &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Form&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openid profile email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_session&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Is this client allowed to use the device_code grant?
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nc"&gt;OAuthClientRepository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get_by_client_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_active&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invalid_client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;allowed_grants&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grant_types&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_DEVICE_GRANT_TYPE&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;allowed_grants&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unauthorized_client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Client not authorized for device_code grant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_device_code_store&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;frontend_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FRONTEND_URL&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;DeviceAuthorizationResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;verification_uri&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;frontend_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/device&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;expires_in&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expires_in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# 900s
&lt;/span&gt;        &lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;       &lt;span class="c1"&gt;# 5s poll interval
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;_DEVICE_GRANT_TYPE&lt;/code&gt; is the RFC's canonical string &lt;code&gt;urn:ietf:params:oauth:grant-type:device_code&lt;/code&gt;. If the client's &lt;code&gt;grant_types&lt;/code&gt; doesn't include it, reject. Not everyone gets device flow — only clients that explicitly opt in.&lt;/p&gt;

&lt;p&gt;Returning &lt;code&gt;interval&lt;/code&gt; (5s) and &lt;code&gt;expires_in&lt;/code&gt; (900s) matters: per RFC, the &lt;em&gt;server&lt;/em&gt; dictates the poll interval and expiry and tells the client. Don't let the client hardcode them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making the two codes
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;device_code&lt;/code&gt; and &lt;code&gt;user_code&lt;/code&gt; play different roles, so build them differently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# device_code: a secret the machine holds. Just needs to be unguessable.
&lt;/span&gt;&lt;span class="n"&gt;device_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;token_urlsafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# user_code: typed by a human. Readability comes first.
&lt;/span&gt;&lt;span class="n"&gt;_USER_CODE_CHARS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ABCDEFGHJKMNPQRSTUVWXYZ23456789&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# drop confusable 0/O/1/I/L
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_generate_user_code&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_USER_CODE_CHARS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_USER_CODE_CHARS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# ABCD-EFGH
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;device_code&lt;/code&gt; gets &lt;code&gt;token_urlsafe(32)&lt;/code&gt; — if it leaks, someone can grab the token, so entropy wins here.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;user_code&lt;/code&gt; is typed by hand, so drop the confusable characters (&lt;code&gt;0&lt;/code&gt;/&lt;code&gt;O&lt;/code&gt;, &lt;code&gt;1&lt;/code&gt;/&lt;code&gt;I&lt;/code&gt;/&lt;code&gt;L&lt;/code&gt;) from the alphabet. The &lt;code&gt;ABCD-EFGH&lt;/code&gt; hyphenated shape makes typos easier to spot. It's a small security-for-UX trade, and it's fine: the &lt;code&gt;user_code&lt;/code&gt; is only used by an already-logged-in user to approve — it's not the token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage: two Redis keys
&lt;/h2&gt;

&lt;p&gt;State lives in Redis: a primary key from &lt;code&gt;device_code&lt;/code&gt; to the state, and an index from &lt;code&gt;user_code&lt;/code&gt; to &lt;code&gt;device_code&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_CODE_PREFIX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device:code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;            &lt;span class="c1"&gt;# device:code:{device_code} -&amp;gt; state JSON (primary)
&lt;/span&gt;&lt;span class="n"&gt;_USER_CODE_PREFIX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device:user_code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# device:user_code:{user_code} -&amp;gt; device_code (index)
&lt;/span&gt;&lt;span class="n"&gt;_DEVICE_CODE_TTL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;900&lt;/span&gt;                    &lt;span class="c1"&gt;# 15 min, matches the MCP client timeout
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;device_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;token_urlsafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;user_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_generate_user_code&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expires_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;_DEVICE_CODE_TTL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Both keys, same TTL, one round trip via pipeline.
&lt;/span&gt;    &lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;_CODE_PREFIX&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_DEVICE_CODE_TTL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;_USER_CODE_PREFIX&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_DEVICE_CODE_TTL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why two keys? Polling arrives by &lt;code&gt;device_code&lt;/code&gt; (that's what the CLI holds). Approval arrives by &lt;code&gt;user_code&lt;/code&gt; (that's what the user types). You need to look up from both directions, so you keep a separate index. Put the same TTL on both and they expire together after 15 minutes — no cleanup job to write. That's the Redis TTL paying off.&lt;/p&gt;

&lt;p&gt;Normalize when looking up by &lt;code&gt;user_code&lt;/code&gt;, because it's human input — it'll arrive lowercase or without the hyphen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_by_user_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;user_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;user_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# ABCDEFGH -&amp;gt; ABCD-EFGH
&lt;/span&gt;    &lt;span class="n"&gt;device_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;_USER_CODE_PREFIX&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;abcdefgh&lt;/code&gt; and &lt;code&gt;ABCD-EFGH&lt;/code&gt; both work. Being strict here causes "it's correct but rejected" UX bugs, so be lenient on input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Endpoint 2: device/token (the poll target)
&lt;/h2&gt;

&lt;p&gt;The CLI hits this every few seconds. It returns different answers for "not yet", "denied", "expired", and "here you go".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/device/token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;device_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;grant_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Form&lt;/span&gt;&lt;span class="p"&gt;(...),&lt;/span&gt;
    &lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Form&lt;/span&gt;&lt;span class="p"&gt;(...),&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Form&lt;/span&gt;&lt;span class="p"&gt;(...),&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_session&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;grant_type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;_DEVICE_GRANT_TYPE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unsupported_grant_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_device_code_store&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_by_device_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expired_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invalid_client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorization_pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorized&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Issue tokens through the same path as the authorization_code flow
&lt;/span&gt;        &lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# one-time use
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;token_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cache-Control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no-store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pragma&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no-cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These &lt;code&gt;error&lt;/code&gt; strings are defined by RFC 8628 — don't invent your own. In particular, &lt;code&gt;authorization_pending&lt;/code&gt; means "the user just hasn't approved yet, this isn't an error, keep polling at the same interval", and any decent client library will quietly wait on it. On &lt;code&gt;access_denied&lt;/code&gt;, delete the device_code immediately — no reason to keep a rejected code alive.&lt;/p&gt;

&lt;p&gt;When authorized, issue the token through the same &lt;code&gt;TokenGenerator&lt;/code&gt; as the authorization_code flow. Device flow doesn't change what's in the token: hash the refresh token into the DB, add an id_token if the &lt;code&gt;openid&lt;/code&gt; scope is present — the normal path. Then delete the device_code to guarantee &lt;strong&gt;one-time use&lt;/strong&gt;. You can't redeem the same device_code twice.&lt;/p&gt;

&lt;p&gt;Don't forget &lt;code&gt;Cache-Control: no-store&lt;/code&gt; on the token response. A token cached by a proxy or browser is an incident waiting to happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Endpoint 3: device/verify (the human approval side)
&lt;/h2&gt;

&lt;p&gt;Called from the verification page (&lt;code&gt;/device&lt;/code&gt;). This is the one endpoint that assumes a &lt;em&gt;logged-in&lt;/em&gt; user, so &lt;code&gt;current_user&lt;/code&gt; is required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/device/verify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;device_verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DeviceVerifyRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CurrentUser&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_user_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid or expired code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This code has already been used&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;authorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorized&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the crucial split. The CLI (holding the device_code) receives the token, but &lt;em&gt;who&lt;/em&gt; the token is issued for is decided by the user logged into this browser. &lt;code&gt;store.authorize&lt;/code&gt; binds &lt;code&gt;current_user.id&lt;/code&gt; to the &lt;code&gt;user_code&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;authorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_user_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# block double-approval / expiry
&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorized&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expires_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;_CODE_PREFIX&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;device_code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;status != "pending"&lt;/code&gt; check stops an already-approved or denied code from being approved again. The state machine is one-directional only: &lt;code&gt;pending → authorized&lt;/code&gt; / &lt;code&gt;pending → denied&lt;/code&gt;. Recomputing the remaining TTL and re-setting with &lt;code&gt;ex=remaining&lt;/code&gt; means approving doesn't extend the lifetime — the code still dies at the original 15-minute mark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Register it in OIDC discovery
&lt;/h2&gt;

&lt;p&gt;Finally, add &lt;code&gt;device_authorization_endpoint&lt;/code&gt; to &lt;code&gt;.well-known/openid-configuration&lt;/code&gt; so an RFC 8628-aware client library can discover the endpoint automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# well_known.py
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device_authorization_endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/oauth/device/authorize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And add &lt;code&gt;device_code&lt;/code&gt; to the client's (Codens MCP's) &lt;code&gt;grant_types&lt;/code&gt;. It only works once both server and client support it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;Device Code Flow looks niche — "authenticate a user without a browser" — but it shows up a lot: MCP, CLI tools, IoT, TV apps. The implementation points that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build the two codes by role: &lt;code&gt;device_code&lt;/code&gt; is a machine secret (high entropy), &lt;code&gt;user_code&lt;/code&gt; is human input (readable, confusable chars removed).&lt;/li&gt;
&lt;li&gt;Two Redis keys (primary + index) plus a TTL makes expiry cleanup structurally unnecessary.&lt;/li&gt;
&lt;li&gt;The state machine starts at &lt;code&gt;pending&lt;/code&gt; and is one-directional; approval happens on a separate endpoint by a logged-in user; tokens are one-time use.&lt;/li&gt;
&lt;li&gt;Follow the RFC for &lt;code&gt;error&lt;/code&gt; strings and let the server drive &lt;code&gt;interval&lt;/code&gt; / &lt;code&gt;expires_in&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anyone building a tool that connects MCP to their own service will hit "how do I log in headless" eventually. Hope this is a useful starting point.&lt;/p&gt;

&lt;p&gt;Codens builds all of this auth machinery into the product.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>oauth</category>
      <category>mcp</category>
      <category>auth</category>
      <category>redis</category>
    </item>
    <item>
      <title>"Autonomous coding agents don't break in the middle, they break at the seams"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Mon, 08 Jun 2026 00:19:45 +0000</pubDate>
      <link>https://dev.to/zoetaka38/autonomous-coding-agents-dont-break-in-the-middle-they-break-at-the-seams-cb8</link>
      <guid>https://dev.to/zoetaka38/autonomous-coding-agents-dont-break-in-the-middle-they-break-at-the-seams-cb8</guid>
      <description>&lt;p&gt;After running AI coding agents in production for a while, one thing became clear: the failures aren't in the code the model writes. They're at the seams — git, CI, auth, the network. The boundaries with the outside world.&lt;/p&gt;

&lt;p&gt;The model itself is genuinely capable. It writes functions, writes tests, refactors. What breaks is everything &lt;em&gt;around&lt;/em&gt; the work: pushing the result, waiting on CI, merging the PR, refreshing a token, calling another service. And the failures are often the kind a human would avoid without thinking.&lt;/p&gt;

&lt;p&gt;Here are five incidents we hit and fixed in Codens' Purple (the orchestration core) over the last few weeks. All real, with production task IDs and dates. Every fix is merged. There's a shared design lesson at the end that ties them together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident 1: a half-resolved merge nearly flooded a PR with 12,000 lines
&lt;/h2&gt;

&lt;p&gt;This was the scary one.&lt;/p&gt;

&lt;p&gt;A Purple task on &lt;code&gt;opsguide-back&lt;/code&gt; opened a PR. I looked inside: &lt;strong&gt;+12,162 lines / 149 files changed, with literal &lt;code&gt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&lt;/code&gt; markers in 2 of them&lt;/strong&gt;. The commit graph:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;e567ce67 (merge commit, "chore: Fix HYBRID_SEARCH...")
 ├ parent[0] = 0b069e5d  (develop tip, +1468 commits over main)
 └ parent[1] = 2940de35  (the actual feature commit)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happened: in the fix step, the AI decided to &lt;code&gt;git merge develop&lt;/code&gt; to backport some test fixes. The merge conflicted. The AI resolved it partially and drove &lt;code&gt;git commit&lt;/code&gt; through anyway with markers still in the tree. What got pushed: develop's entire divergence plus unresolved conflict markers. If anyone had clicked merge, main would have been polluted by 1468 commits of develop drift in one shot.&lt;/p&gt;

&lt;p&gt;A human wouldn't do this. They wouldn't merge develop into a main-targeted PR in the first place, and if it conflicted they wouldn't commit until it was fully resolved. But the AI, optimizing locally to get one test passing, does it without hesitation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: stop it at push time, in two layers
&lt;/h3&gt;

&lt;p&gt;A single &lt;code&gt;git&lt;/code&gt; &lt;code&gt;pre-push&lt;/code&gt; hook. This is where the AI's &lt;code&gt;git push&lt;/code&gt; actually goes, so this is where the guard belongs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt;

&lt;span class="c"&gt;# Layer 1: conflict-marker scan (always on, no config)&lt;/span&gt;
scan_conflict_markers&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;local &lt;/span&gt;hits
    &lt;span class="c"&gt;# Match markers at column 0 followed by a space, so we don't&lt;/span&gt;
    &lt;span class="c"&gt;# false-positive on "=======" markdown rules or ASCII art.&lt;/span&gt;
    &lt;span class="nv"&gt;hits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;git &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-lE&lt;/span&gt; &lt;span class="s1"&gt;'^(&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; |======= |&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; )'&lt;/span&gt; HEAD 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$hits&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"purple-pre-push: ABORT — committed files contain conflict markers:"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2
        &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$hits&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/^/  /'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2
        &lt;span class="k"&gt;return &lt;/span&gt;1
    &lt;span class="k"&gt;fi
    return &lt;/span&gt;0
&lt;span class="o"&gt;}&lt;/span&gt;

scan_conflict_markers &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is scanning the &lt;strong&gt;committed tree&lt;/strong&gt; (&lt;code&gt;HEAD&lt;/code&gt;). The working directory may have been cleaned up, but markers that made it into a commit stay. &lt;code&gt;HEAD&lt;/code&gt; is what's about to be pushed, so that's what you &lt;code&gt;git grep&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The regex &lt;code&gt;^(&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; |======= |&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; )&lt;/code&gt; matters for precision: &lt;code&gt;=======&lt;/code&gt; shows up in markdown headings and tables all the time, so we match only the exact shape of a git conflict marker — start of line, then a space.&lt;/p&gt;

&lt;p&gt;Layer 2 is a merge-source allowlist, configurable per workflow. It only runs when a policy file is present:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Layer 2: merge-source allowlist (only when a policy file exists)&lt;/span&gt;
&lt;span class="c"&gt;# {&lt;/span&gt;
&lt;span class="c"&gt;#   "feature": "feature/&amp;lt;task-id&amp;gt;",&lt;/span&gt;
&lt;span class="c"&gt;#   "base":    "main",&lt;/span&gt;
&lt;span class="c"&gt;#   "allowed": ["develop", "release/x"]&lt;/span&gt;
&lt;span class="c"&gt;# }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each new merge commit on the pushed ref, we check that every parent is reachable from &lt;code&gt;feature&lt;/code&gt; / &lt;code&gt;base&lt;/code&gt; / one of &lt;code&gt;allowed&lt;/code&gt;, using &lt;code&gt;git merge-base --is-ancestor&lt;/code&gt;. A merge from a disallowed source is rejected. Blank policy means no check — it's opt-in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;p &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;$parents&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
    &lt;span class="k"&gt;for &lt;/span&gt;rname &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;refs&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
        if &lt;/span&gt;git merge-base &lt;span class="nt"&gt;--is-ancestor&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$p&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$rname&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
            &lt;/span&gt;&lt;span class="nv"&gt;ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;break
        &lt;/span&gt;&lt;span class="k"&gt;fi
    done
    if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ok&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ABORT — merge commit has parent not from allowed sources"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2
        &lt;span class="nv"&gt;bad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
    &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the unglamorous-but-important part: &lt;strong&gt;fail-safe&lt;/strong&gt;. If the hook itself has a bug and errors, the push still proceeds. A guard bug stopping every workflow is worse than the occasional incident slipping through. Layer 1 is just &lt;code&gt;git grep&lt;/code&gt; and &lt;code&gt;git log&lt;/code&gt; (tiny surface area); layer 2 falls back to permissive if &lt;code&gt;jq&lt;/code&gt; isn't available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident 2: a transient network blip misclassified as a permanent failure
&lt;/h2&gt;

&lt;p&gt;A task routed through a self-hosted model gateway (vLLM behind Cloudflare) died with exit 1 after ~27 minutes of work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API Error: The socket connection was closed unexpectedly.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gateway was healthy at session start and recovered immediately after — &lt;code&gt;GET /health&lt;/code&gt; returned 200 in 0.56s by the time I looked. So it was a momentary mid-session disconnect: the same Cloudflare-fronted overload pattern that already drives the 524 retry path, just surfacing as a closed socket from node's &lt;code&gt;fetch&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The problem: the existing retry regex covered &lt;code&gt;524 / origin_response_timeout / connection reset / Too Many Requests&lt;/code&gt; but had no entry for the closed-socket case. So the task was classified "non-transient error (exit=1), not retrying," and the whole step got escalated to Slack to wait for a human re-dispatch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: add the patterns, and trust that false positives are cheap
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Patterns we treat as transient (safe to cleanly retry)
&lt;/span&gt;&lt;span class="n"&gt;_GW_TRANSIENT_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;524|origin_response_timeout|Too Many Requests|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;connection reset|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# added:
&lt;/span&gt;    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;socket connection was closed|socket hang up|ECONNRESET|fetch failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Added to both classifier sites: the shell-side retry loop in the per-job container, and the Python-side clean-retry detector in the workflow engine.&lt;/p&gt;

&lt;p&gt;Here's the core idea of this whole post:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Adding to the transient list is always safe. A false positive (treating a real permanent failure as transient) only wastes a 30–90s backoff. The AI is idempotent over the same prompt, so state isn't corrupted. A false &lt;em&gt;negative&lt;/em&gt; (treating a real transient as permanent) escalates to Slack and stops a human.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;False positives are cheap, false negatives are expensive.&lt;/strong&gt; So bias the classifier toward transient. This asymmetry holds for job systems in general, but it's sharper for agents: each run is tens of minutes plus inference cost, so the unit cost of a human escalation is unusually high.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident 3: merging before a late-registering CI check even appears
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;wait_ci&lt;/code&gt; built its "required checks" list from the checks observable when the PR opened.&lt;/p&gt;

&lt;p&gt;But &lt;code&gt;opsguide-back&lt;/code&gt;'s &lt;code&gt;test&lt;/code&gt; job builds a Docker image first, so it &lt;strong&gt;registers ~3 minutes after the PR opens&lt;/strong&gt;. It wasn't in the PR-open snapshot. So &lt;code&gt;wait_ci&lt;/code&gt; passed early without waiting for it, and the downstream &lt;code&gt;merge_pr&lt;/code&gt; hit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Repository rules blocked merge: 405
Required status check "test" is failing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actual timeline (2026-05-20, opsguide-back #11284):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;04:25:25  wait_ci starts  required_checks=[check-develop-only-files,
          export-and-check, format-check, lint, check-single-head]  ← no test
04:28     test job starts
04:33     test FAILED  ← but wait_ci had already moved on to merge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fix: treat branch protection, not the observed snapshot, as the source of truth
&lt;/h3&gt;

&lt;p&gt;GitHub branch protection has &lt;code&gt;required_status_checks&lt;/code&gt; — the canonical list GitHub actually gates the merge on. Read that instead of a snapshot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_required_status_check_contexts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;branch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Read branch protection's required_status_checks.
&lt;/span&gt;    &lt;span class="c1"&gt;# Return [] on 404/403 so unprotected branches / missing perms
&lt;/span&gt;    &lt;span class="c1"&gt;# fall back to current behaviour.
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Union those contexts into &lt;code&gt;wait_ci&lt;/code&gt;'s required checks with &lt;code&gt;strict=True&lt;/code&gt;. Strict mode already &lt;em&gt;waits&lt;/em&gt; for a required check that hasn't appeared yet (returns &lt;code&gt;waiting()&lt;/code&gt; when the run is None/incomplete), so the late &lt;code&gt;test&lt;/code&gt; now gets waited for, evaluated, and a failure routes to &lt;code&gt;fix&lt;/code&gt; instead of slipping through to a 405 at merge.&lt;/p&gt;

&lt;p&gt;The lesson: don't let "what I can see right now" be the system's truth. In a world where CI checks register asynchronously, an observation snapshot is always going to be stale. Read the gate definition itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident 4: the singular form of a fast check re-queued at merge time
&lt;/h2&gt;

&lt;p&gt;This one is a single regex, but the kind that eats an afternoon.&lt;/p&gt;

&lt;p&gt;Two tasks failed at merge-pr with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Repository rules blocked merge: 405
Required status check "check-branch-name" is expected.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;_is_ci_pending_error&lt;/code&gt; only matched the &lt;strong&gt;plural&lt;/strong&gt; wording &lt;code&gt;"N of M required status checks are expected"&lt;/code&gt; — i.e. &lt;code&gt;"are expected"&lt;/code&gt;. When exactly one required check is incomplete, GitHub uses the &lt;strong&gt;singular&lt;/strong&gt; &lt;code&gt;Required status check "X" is expected.&lt;/code&gt; That fell straight through the pending detector into a hard failure.&lt;/p&gt;

&lt;p&gt;Why does a check &lt;code&gt;wait_ci&lt;/code&gt; saw green re-queue at merge time? &lt;code&gt;check-branch-name&lt;/code&gt; is a fast check, and &lt;code&gt;merge_pr&lt;/code&gt; recomputes the merge base right before merging. GitHub re-evaluates branch protection against the new head and briefly reports the fast check as "expected" again until it re-reports success. The bounded retry loop was built for exactly this window — it just wasn't being entered for the single-check case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Route both singular "is expected" and plural "are expected" into retry
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# treat as CI-pending; back off and retry
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Match the bare token &lt;code&gt;"expected"&lt;/code&gt;. The only GitHub merge-block messages containing "expected" are these pending-check wordings, so widening the match can't misclassify a genuine policy rejection (required signed commits, etc.) as transient — covered by the existing regression test.&lt;/p&gt;

&lt;p&gt;Unglamorous, but singular-vs-plural and 3-minute-delays are the actual things that stop autonomous agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident 5: a borrowed token already expired the moment it arrived
&lt;/h2&gt;

&lt;p&gt;Codens' per-task workers run on &lt;em&gt;borrowed&lt;/em&gt; shared OAuth credentials. The &lt;code&gt;refreshToken&lt;/code&gt; is intentionally stripped: if it weren't, each worker's CLI would refresh independently and rotate the shared OAuth identity, cascading 401s across siblings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;src.token_refresh: token is expired — attempting refresh before job start
src.token_refresh: Token refresh failed: no refreshToken in credentials file
POST /jobs &lt;/span&gt;&lt;span class="k"&gt;HTTP&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt; &lt;span class="ne"&gt;Internal Server Error&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the borrower can't refresh on its own. If the &lt;code&gt;accessToken&lt;/code&gt; it receives is already past &lt;code&gt;expiresAt&lt;/code&gt; at the moment of receipt, the worker's pre-job check dies on "no refreshToken" and &lt;code&gt;POST /jobs&lt;/code&gt; returns 500.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: the source that holds the refreshToken refreshes before returning
&lt;/h3&gt;

&lt;p&gt;The root cause: the source credential service's &lt;code&gt;GET /claude-auth&lt;/code&gt; returned the stored credentials as-is, expiry included. The only place that holds the canonical &lt;code&gt;refreshToken&lt;/code&gt; is the source, so refresh there before returning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_claude_auth&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Refresh before returning. No-op when &amp;gt;5 min of life remains,
&lt;/span&gt;    &lt;span class="c1"&gt;# so the common case adds zero round-trips.
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;run_in_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ensure_valid_token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;load_credentials&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ensure_valid_token&lt;/code&gt; does nothing if the token has more than 5 minutes left, so it's free in the common case. Only when it's under threshold does the one place with the refreshToken (the source) refresh, write the new token, and then return it.&lt;/p&gt;

&lt;p&gt;The naive design — "the borrower refreshes" — didn't match the architectural constraint of a shared identity. Only one party &lt;em&gt;can&lt;/em&gt; refresh. So that party does it before returning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three principles that recur
&lt;/h2&gt;

&lt;p&gt;Line up all five and the fixes share a shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Autonomous agents break at the seams.&lt;/strong&gt; Not in output quality, but at the boundaries: git, CI, auth, network. So the fixes point at hardening boundaries, not at smarter models. A pre-push hook, a source of truth for CI gates, the right party to refresh a token — all classic systems design, all unrelated to the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. False positives and false negatives have asymmetric cost.&lt;/strong&gt; A misclassified transient costs a tens-of-seconds backoff; a missed one costs a human escalation. Agent runs are long and expensive, so the cost of stopping a human is unusually high. Bias classifiers toward retry. Idempotency makes that safe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Guards must be fail-safe.&lt;/strong&gt; A safety mechanism's own bug must not stop the main flow. The pre-push hook lets the push proceed if the hook itself errors unexpectedly. We weigh "every workflow stops" heavier than "an incident occasionally slips through."&lt;/p&gt;

&lt;p&gt;The more you let the AI do, the more these not-in-the-middle details pay off. Smarter models don't make git conflict markers disappear, don't make CI checks register synchronously, don't stop tokens from expiring. Keeping an agent running in production turned out to be the work of closing these seams, one at a time.&lt;/p&gt;

&lt;p&gt;Codens builds all of this into the product. Take a look if you're interested.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>ci</category>
      <category>git</category>
    </item>
    <item>
      <title>"Why we told our AI plan generator to never split tests into a separate sub-task"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Tue, 26 May 2026 08:24:33 +0000</pubDate>
      <link>https://dev.to/zoetaka38/why-we-told-our-ai-plan-generator-to-never-split-tests-into-a-separate-sub-task-2pc9</link>
      <guid>https://dev.to/zoetaka38/why-we-told-our-ai-plan-generator-to-never-split-tests-into-a-separate-sub-task-2pc9</guid>
      <description>&lt;p&gt;The run was marked failed. Two of the three sub-tasks merged cleanly. The third one, titled "Add tests for is_sent=True treated as read in test_inbox_service_unread_propagation.py", never finished. CI retried up to the cap, all failures, then gave up. The whole plan was thrown out even though two thirds of the actual code had already landed on green branches.&lt;/p&gt;

&lt;p&gt;The fix turned out to be one paragraph in one prompt. Not a code change in the dispatcher. Not a new CI flag. Just a rule that says: if a sub-task introduces or modifies code, the unit tests for that code go in the same sub-task. The "tests as their own task" pattern is forbidden.&lt;/p&gt;

&lt;p&gt;Here is what I observed, why the AI reached for the wrong decomposition, and the exact prompt rule that closed the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually happened
&lt;/h2&gt;

&lt;p&gt;Codens Purple has what I call a plan generator. That is the part of the system that takes one PRD or bug report and breaks it into sub-tasks. Each sub-task then gets dispatched on its own Git branch, runs in parallel with the others, and merges back to the base when its CI goes green. The piece of the plan generator that actually does the splitting is driven by what we internally call the analyze prompt, which is just the system prompt the model sees when it decides "how should this work be carved up."&lt;/p&gt;

&lt;p&gt;On a project called opsguide-back, for one bug, the plan generator produced this triple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Add tests for is_sent=True treated as read in
   test_inbox_service_unread_propagation.py
2. Fix _store_messages_batch in inbox_service.py to mark
   self-sent messages as read
3. Add sender_email exclusion to _build_activity_unread_count
   in resolver.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you read that as a human reviewer, it looks great. Three clean concerns, easy to review independently, no overlap in files touched. Textbook parallelization.&lt;/p&gt;

&lt;p&gt;It died anyway. Sub-tasks 2 and 3 both finished and merged. Sub-task 1, the test-only one, kept failing CI. Its branch contained only changes to the test file. The implementation functions it was asserting against did not exist on that branch yet, because the implementation lived on a sibling branch that this branch could not see. pytest collected the test, tried to import the helpers, and the asserted behaviour was simply not present. Retry, retry, retry, give up. Run failed.&lt;/p&gt;

&lt;p&gt;The cruel part is that if the merge order had happened to put the test branch last, after both impl branches had landed, the test would have passed. But we cannot guarantee that order. Each sub-task races on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the AI did this
&lt;/h2&gt;

&lt;p&gt;This was not a model failure. The model did exactly what every general-purpose decomposition heuristic would tell you to do. Split tests from implementation so they can move in parallel. That is correct advice for a human team, where the reviewer and the merge queue keep the order honest, and where a developer can rebase a test PR onto the impl PR before merging.&lt;/p&gt;

&lt;p&gt;The thing the model did not know is that our dispatch system runs each sub-task on its own isolated branch. Each sub-task sees the base branch plus its own changes, and nothing else. Sibling sub-tasks' work is invisible to it until merge time. That is not a universal fact about software development. It is a property of how we, specifically, run parallel agents. Nothing in the model's training corpus tells it that this constraint applies, because most of the corpus is about human teams.&lt;/p&gt;

&lt;p&gt;So the model reached for the most-cited decomposition pattern it knew, which happens to be wrong for our dispatcher. The mistake lived in the prompt. We had been asking the model to plan parallel work without telling it the actual rules of "parallel" in our system.&lt;/p&gt;

&lt;p&gt;This is the general shape of a lot of AI agent failures I have hit. The agent is not bad at reasoning. It is reasoning correctly in the wrong universe, because the prompt forgot to describe the universe.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;We added this block to the analyze prompt. It is the only change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## CRITICAL: Tests live with their implementation

NEVER split tests for new behaviour into a separate sub-task. Every sub-task
that introduces or modifies code MUST also add the unit tests for that code
in the SAME sub-task. The pattern "Sub-task A: implement X / Sub-task B:
add tests for X" is FORBIDDEN.

Title heuristic: if you are about to write a sub-task title that starts
with "Add tests for ..." or "Write tests for ...", STOP and merge it
into the impl sub-task whose code it tests.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things are doing the work here. The first is the explicit "FORBIDDEN" framing. The second, which I think matters more in practice, is the title heuristic. The model writes the title before it writes the body. If we can get it to catch itself at the title stage, the bad plan never gets generated in the first place, so we do not have to rely on a later pass to repair it.&lt;/p&gt;

&lt;p&gt;We also rewrote the few-shot examples in the same prompt. Before, the example impl sub-task's &lt;code&gt;## Steps&lt;/code&gt; section only listed source-code file edits. After, every example impl sub-task lists the implementation file edit and the test file edit side by side. Roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt; ## Steps
 1. Edit src/inbox_service.py: in _store_messages_batch,
    set is_read=True when message.sender_email == account_owner_email.
&lt;span class="gi"&gt;+2. Edit tests/test_inbox_service_unread_propagation.py:
+   add unit test asserting is_sent=True self-messages count
+   as read.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tiny diff is the part that changes behaviour. Models pattern-match very strongly on few-shot examples. If every example shows tests bundled with impl, the model produces the same shape.&lt;/p&gt;

&lt;p&gt;Since the rule went in, the plan generator has stopped emitting "Add tests for ..." sub-tasks on new behaviour. The test-only failure mode is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exception
&lt;/h2&gt;

&lt;p&gt;There is one shape of test-only sub-task that is still fine. If we are backfilling a regression test for code that is already on the base branch, the test-only sub-task is allowed. The reason is symmetrical to the original failure: when the implementation already exists on main, a test-only branch has everything it needs to compile, import, and assert. pytest finds the function, the test runs, CI passes.&lt;/p&gt;

&lt;p&gt;The prompt calls that out explicitly so the model does not over-apply the new rule and start refusing legitimate backfill work. The line in the prompt is roughly "the rule is about new behaviour introduced in this plan, not about all test-only sub-tasks ever."&lt;/p&gt;

&lt;h2&gt;
  
  
  Generalizing
&lt;/h2&gt;

&lt;p&gt;The bigger lesson is that AI agents reach for human-team decompositions by default, and that is fine when your dispatch system also behaves like a human team. Most agent dispatch systems do not. Ours runs sub-tasks on isolated branches with no cross-visibility. Some teams run agents in long-lived shared worktrees. Some serialize. Each of these creates its own invisible constraint on what can and cannot be split.&lt;/p&gt;

&lt;p&gt;The agent does not know which one you have. It cannot infer it from the codebase, because none of those constraints are encoded in the code. They live in the dispatcher.&lt;/p&gt;

&lt;p&gt;So the work, when you start letting an agent plan parallel sub-tasks, is to spend prompt tokens drawing the line between what can be split and what cannot. For us that line was: tests for new code live with the new code. For someone else it might be: never split a migration from the code that depends on it. Or: never split a config change from the deployment that consumes it. The shape of the rule depends entirely on your dispatcher, not on the model.&lt;/p&gt;

&lt;p&gt;The pattern I would suggest is to add a single "CRITICAL" section to the planning prompt that enumerates the constraints your dispatcher imposes. Use a title-stage heuristic so the model self-rejects bad plans before generating the body. Rewrite the few-shot examples to demonstrate the right shape, because that is what the model actually copies.&lt;/p&gt;

&lt;p&gt;We rebuild Codens with Codens. Every prompt rule like this one came from watching a real run fail and adding the one sentence that would have prevented it. If you want to see how the parallel planner works end to end, the English landing page is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>ci</category>
      <category>python</category>
    </item>
    <item>
      <title>"Why your Playwright screenshots show for Japanese / Chinese / Korean text, and the 3-line Dockerfile fix"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Mon, 25 May 2026 06:44:17 +0000</pubDate>
      <link>https://dev.to/zoetaka38/why-your-playwright-screenshots-show-for-japanese-chinese-korean-text-and-the-3-line-15pj</link>
      <guid>https://dev.to/zoetaka38/why-your-playwright-screenshots-show-for-japanese-chinese-korean-text-and-the-3-line-15pj</guid>
      <description>&lt;p&gt;I opened the screenshot artifact for our codens.ai landing page smoke test and the page was full of square boxes. Where the Japanese hero copy should have been, there was a row of □□□□□. Where the feature names were, more boxes. The nav looked like an ancient artifact from a half-decoded file.&lt;/p&gt;

&lt;p&gt;The page itself was fine. I had the dev server open in another tab and the Japanese rendered perfectly. The problem was inside the Playwright container.&lt;/p&gt;

&lt;p&gt;Three lines in the Dockerfile fixed it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;    fonts-noto-cjk \
    fonts-noto-cjk-extra \
    fonts-noto-color-emoji \
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire fix. If you only came for the answer, you can close the tab now. If you want to know why this happens and where else it will bite you, keep reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is actually happening
&lt;/h2&gt;

&lt;p&gt;The official Playwright Docker image (and most slim base images people build on) only installs Latin fonts. In our case it was &lt;code&gt;fonts-liberation&lt;/code&gt; plus &lt;code&gt;fonts-dejavu-core&lt;/code&gt;. That is enough to render English, most European languages, basic punctuation, and not much else.&lt;/p&gt;

&lt;p&gt;When Chromium tries to paint a character it has no glyph for, it does the only thing it can do. It draws the missing-glyph placeholder, which on most systems is that hollow rectangle people call a tofu box. The character code is correct. The DOM is correct. The page is correct. The screenshot rendering side just has no shape to draw.&lt;/p&gt;

&lt;p&gt;This is the part that confuses people the first time. The browser is not broken. The test is not broken. The page is not broken. The container does not have the font installed, so when the screenshot is composited there is nothing to fill the box with.&lt;/p&gt;

&lt;p&gt;You can verify this in two seconds. SSH into the container, run &lt;code&gt;fc-list | grep -i cjk&lt;/code&gt;, and you will see an empty result. That is the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;Three apt packages, added to whatever &lt;code&gt;RUN apt-get install&lt;/code&gt; block already exists in your Dockerfile.&lt;/p&gt;

&lt;p&gt;Before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    fonts-liberation &lt;span class="se"&gt;\
&lt;/span&gt;    fonts-dejavu-core &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    fonts-liberation &lt;span class="se"&gt;\
&lt;/span&gt;    fonts-dejavu-core &lt;span class="se"&gt;\
&lt;/span&gt;    fonts-noto-cjk &lt;span class="se"&gt;\
&lt;/span&gt;    fonts-noto-cjk-extra &lt;span class="se"&gt;\
&lt;/span&gt;    fonts-noto-color-emoji &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each one buys you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;fonts-noto-cjk&lt;/code&gt; is the main package. It covers Japanese kana, the Han characters used in both Japanese and Simplified Chinese, and Korean Hangul. This is the one that fixes most of the boxes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fonts-noto-cjk-extra&lt;/code&gt; covers the long tail. Traditional Chinese variants, less common Han glyphs, characters that show up in proper nouns. Worth including because the cost is small and you do not want to debug a single rare character later.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fonts-noto-color-emoji&lt;/code&gt; is the one people forget. If your page has any emoji, you will get tofu for those too. Most modern marketing pages have at least a checkmark or a sparkle somewhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Image size impact is about 70 MB on a Debian or Ubuntu base. CJK font files are large because there are tens of thousands of glyphs. If you are squeezing every megabyte you can use the smaller variable-weight subset, but for a CI image used by a test runner the 70 MB is irrelevant.&lt;/p&gt;

&lt;p&gt;I shipped this in commit 40422650 for Codens Blue, our QA agent. Rebuilt the image, reran the same smoke test, and the screenshot came out with actual readable Japanese.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why you only notice after the fact
&lt;/h2&gt;

&lt;p&gt;This is the annoying part. Nothing in your test suite tells you the screenshot is broken.&lt;/p&gt;

&lt;p&gt;Unit tests pass. The page renders correctly when a human visits it. The Playwright test reports green because the test only checks that the page loaded and the screenshot was saved. CI is happy. The artifact thumbnail in the GitHub Actions UI is tiny and you cannot tell tofu from text at that size.&lt;/p&gt;

&lt;p&gt;You notice when someone opens the screenshot to share it. A designer asks for the latest LP screenshot to compare against a Figma mock. A stakeholder pulls a screenshot for a Slack thread. A regression alert fires and you open the diff. That is when the boxes show up and someone asks why the page is full of squares.&lt;/p&gt;

&lt;p&gt;You can technically assert against tofu rendering inside the test. Sample a region that should contain CJK text, check that not every pixel in that region is identical white, fail if it looks suspiciously uniform. I have seen people do this. The implementation cost almost never beats the cost of just installing the fonts once. Three lines of Dockerfile beats a hundred lines of pixel sampling logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The same trap is everywhere
&lt;/h2&gt;

&lt;p&gt;Playwright is just the messenger. Anything that wraps a headless Chromium in a Docker container has this problem if the base image lacks CJK fonts.&lt;/p&gt;

&lt;p&gt;Puppeteer, pyppeteer, playwright-python, Selenium with headless Chrome, any custom screenshot service built on chrome-launcher, server-side rendering pipelines that use headless Chrome to generate Open Graph images. Same root cause every time. Same fix every time.&lt;/p&gt;

&lt;p&gt;If your product touches any audience outside Latin script, default to installing the CJK and emoji fonts in your base image. Treat it as part of the container setup, not as a thing you wait to hit. The cost is 70 MB and three lines. The cost of not doing it is some future Slack message that says "why is the page full of boxes" and then an afternoon of confused debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;That is the whole thing. Three apt packages, one rebuild, done. If you are running Codens Blue or any other screenshot-based QA flow against a multilingual page, this is the first place to look when boxes appear.&lt;/p&gt;

&lt;p&gt;If you want to see the actual landing page these screenshots are taken from, it lives at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>playwright</category>
      <category>docker</category>
      <category>i18n</category>
      <category>e2e</category>
    </item>
    <item>
      <title>"Adding Cursor Composer 2.5 as a third executor lane: 10x cheaper than Opus at comparable scores, but smoke tells a different story"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Mon, 25 May 2026 00:00:19 +0000</pubDate>
      <link>https://dev.to/zoetaka38/adding-cursor-composer-25-as-a-third-executor-lane-10x-cheaper-than-opus-at-comparable-scores-hf2</link>
      <guid>https://dev.to/zoetaka38/adding-cursor-composer-25-as-a-third-executor-lane-10x-cheaper-than-opus-at-comparable-scores-hf2</guid>
      <description>&lt;p&gt;A roughly tenfold per-task cost drop at comparable accuracy is one of those numbers you do not get to ignore for very long. Composer 2.5 published SWE-Bench Multilingual figures in the same neighborhood as Opus, and the per-attempt API cost is about an order of magnitude lower. For an agent harness that runs hundreds of attempts per project per week, a 10x cost compression on a viable lane reshapes the unit economics enough to justify a real integration, not just a spike.&lt;/p&gt;

&lt;p&gt;So I shipped Composer 2.5 as a third executor lane in Codens Purple, the orchestration service that decides which model runs each task. Codens was already running two lanes side by side: Claude via the raw Anthropic API and a self-hosted Qwen deployment. The third lane went in over two days, May 23-24, across a Phase 1 skeleton commit, a Phase 2 SDK wire, an ECS Fargate task definition change, an IAM credential isolation fix, and a one-project canary toggle.&lt;/p&gt;

&lt;p&gt;Then I ran a smoke pass. 16 failed out of 25 attempts across v4 through v17. The integration works. The benchmark numbers are not the production numbers. This is the writeup of both halves: what shipped, and what the smoke phase actually told me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a third lane at all
&lt;/h2&gt;

&lt;p&gt;The case for a third lane is the same case I made earlier this year for the per-model retry cap pattern. Each model has its own failure shape and its own cost curve. Pinning the whole harness to one provider means inheriting one bill, one rate-limit policy, and one definition of "the model got it wrong."&lt;/p&gt;

&lt;p&gt;Composer 2.5 changes the cost arithmetic in a way that matters at our retry caps. Codens retries each task per model up to a cap: claude=3, qwen=6, composer-2.5=5 for now. At cap=3 with Opus, the worst-case attempt cost dominates the per-task budget. At cap=3 with Composer 2.5 at roughly 1/10 the per-attempt rate and comparable accuracy, the worst-case attempt cost drops by roughly an order of magnitude even before factoring in higher-than-Opus first-pass success. That math is what made integration time worth spending.&lt;/p&gt;

&lt;p&gt;The optionality argument also got stronger recently. Anthropic clarified that the Agent SDK and &lt;code&gt;claude -p&lt;/code&gt; CLI workflows are not covered by subscription plans for agent use cases, which validates the API-direct path Codens already runs on. Adding a Cursor lane on top of that is the same bet, extended: do not get pinned to any one vendor's pricing or policy, and keep the harness free to route tasks to whichever lane wins on cost and reliability for the workload at hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Executor lane design
&lt;/h2&gt;

&lt;p&gt;The pleasant part of the design was that &lt;code&gt;PurpleTask.execute_model&lt;/code&gt; already supported per-task model switching, and &lt;code&gt;PurpleProject.default_model&lt;/code&gt; already let an entire project pin a model. Adding the third lane was not an architecture change. It was an enum value plus a new runner module.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PurpleTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# existing fields elided
&lt;/span&gt;    &lt;span class="n"&gt;execute_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;composer-2.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execute_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The runner dispatcher already had two branches: &lt;code&gt;runner_claude.py&lt;/code&gt; for the Anthropic API path that wraps the &lt;code&gt;claude -p&lt;/code&gt; CLI, and &lt;code&gt;runner_qwen.py&lt;/code&gt; for the self-hosted endpoint. The third runner, &lt;code&gt;runner_cursor.py&lt;/code&gt;, slots in next to those two with the same input contract (task spec, workspace dir, env) and the same output contract (workspace diff, structured result, failure_reason on non-zero).&lt;/p&gt;

&lt;p&gt;I split the change into two commits on purpose. Phase 1 was a validation-only runner that exited non-zero on every invocation, plus the enum addition. Shippable in isolation, zero behavior change for existing tasks because nothing pointed at &lt;code&gt;composer-2.5&lt;/code&gt; yet. Phase 2 was the actual SDK call. Splitting like this means each commit can be reverted on its own, and the enum migration is not coupled to any SDK behavior question.&lt;/p&gt;

&lt;p&gt;I have learned the hard way that bundling an enum addition with the runtime that depends on it produces commits you cannot cleanly revert when the runtime turns out to be the problem. Phase 1 / Phase 2 splits are cheap insurance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: the skeleton
&lt;/h2&gt;

&lt;p&gt;Phase 1, commit &lt;code&gt;5a575031&lt;/code&gt;, did three things and nothing else. It added &lt;code&gt;composer-2.5&lt;/code&gt; to the model enum, registered &lt;code&gt;runner_cursor.py&lt;/code&gt; in the dispatch table, and made the runner validate its inputs and exit non-zero with a clear "not yet implemented" failure_reason. The migration ran on staging. The dispatch table picked up the new entry. No production task pointed at the new lane, so the runner was never invoked in the live path.&lt;/p&gt;

&lt;p&gt;This is the kind of commit that looks like it does nothing and is actually doing the most important thing: proving the surrounding plumbing is correct before the new code can hide bugs in the plumbing. If Phase 2 had landed in one shot and the SDK call had failed, I would have spent the next hour trying to figure out whether the failure was in the dispatcher, the env wiring, the IAM role, or the SDK. With Phase 1 already in production for an hour, the only thing Phase 2 could break was the SDK call itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: wiring the Cursor SDK
&lt;/h2&gt;

&lt;p&gt;Phase 2, commit &lt;code&gt;b1e7ebcd&lt;/code&gt;, is where the real work happened. The Cursor Python SDK exposes a session that walks Bridge → Client → Agent → events. The shape in the runner is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;Bridge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bridge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bridge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ModelSelection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;LocalAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cwd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;workspace_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SendOptions&lt;/span&gt;&lt;span class="p"&gt;(...))&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;events&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;handle_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;local=LocalAgentOptions(cwd=workspace_dir)&lt;/code&gt; part matters: Cursor agents can run remotely or locally, and for Codens the workspace is already mounted into the Fargate task at a known path, so local-mode keeps the file IO inside the task and avoids round-tripping the diff over the wire. &lt;code&gt;agent.send&lt;/code&gt; returns a run handle whose &lt;code&gt;events()&lt;/code&gt; async iterator yields the structured event stream we already know how to consume from the Claude path. The translation layer in &lt;code&gt;runner_cursor.py&lt;/code&gt; normalizes Cursor's event shapes to the internal event schema that the rest of Purple already speaks.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CURSOR_API_KEY&lt;/code&gt; is the obvious blocker. We store it in AWS Secrets Manager at &lt;code&gt;purple-codens-prod/cursor-api-key&lt;/code&gt; and inject it into the per-task environment so the SDK picks it up automatically. The ECS Fargate task definition change in PR #1156 (commits &lt;code&gt;d1ef5db4&lt;/code&gt; and &lt;code&gt;656f42e4&lt;/code&gt;) exposes the secret ARN as an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CURSOR_API_KEY_SECRET_ARN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:secretsmanager:ap-northeast-1:...:secret:purple-codens-prod/cursor-api-key"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entrypoint script resolves it before launching the runner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CURSOR_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws secretsmanager get-secret-value &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--secret-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CURSOR_API_KEY_SECRET_ARN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query&lt;/span&gt; SecretString &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;CURSOR_API_KEY
&lt;span class="nb"&gt;exec &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; purple.runner_cursor &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This part is where I introduced a bug I want to flag specifically, because it is the kind of bug a multi-tenant SaaS should never ship. Initial commit pulled the secret using whatever &lt;code&gt;AWS_PROFILE&lt;/code&gt; was active in the task environment, which in some code paths inherited from the customer's connected AWS credentials. That is wrong in a multi-tenant harness. The fix in commit &lt;code&gt;6210a052&lt;/code&gt; makes the entrypoint use the ECS task IAM role for the Secrets Manager call, never the customer's profile. Customer credentials are scoped to customer resources only. Platform credentials, including our Cursor API key, must resolve through the task role. Easy mistake, important fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  The canary procedure
&lt;/h2&gt;

&lt;p&gt;I do not trust new lanes in production until a real project has run on them for at least a day. The canary procedure (commit &lt;code&gt;d6fe3cb3&lt;/code&gt;) is intentionally small: flip &lt;code&gt;purple_projects.default_model = 'composer-2.5'&lt;/code&gt; on exactly one internal Corevice-org project, dogfood it, and watch the metrics. Every other project stays on whatever model they were already on, which means the canary is fully isolated.&lt;/p&gt;

&lt;p&gt;The SQL is one row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;purple_projects&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;default_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'composer-2.5'&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;internal-project-id&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rollback is the same statement with the prior value. No code deploy involved. This is one of the upsides of keeping model selection as runtime data rather than baking it into deploy artifacts: rollback is a transaction, not a release.&lt;/p&gt;

&lt;p&gt;The comparison axes we track on the canary versus the same project's last 30 days on Opus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completion rate (task finishes without exhausting retries)&lt;/li&gt;
&lt;li&gt;Verify pass rate (Codens verify steps succeed against the final diff)&lt;/li&gt;
&lt;li&gt;Wall time per task&lt;/li&gt;
&lt;li&gt;Cost per completed task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point of the canary is not to certify the lane is good. The point is to surface the failure modes that benchmarks do not surface, before any real customer touches the new lane.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the smoke runs actually showed
&lt;/h2&gt;

&lt;p&gt;Across v4 through v17, the smoke pass ran 25 attempts on the canary project. Nine finished. Sixteen failed. That is a 36% completion rate on a workload where the equivalent Opus runs were sitting around 80%+. The benchmark numbers and the production numbers were not the same numbers.&lt;/p&gt;

&lt;p&gt;Two failure modes accounted for almost all of the misses.&lt;/p&gt;

&lt;p&gt;The Cursor SDK bridge dropped mid-session on a handful of long-running tasks. When the bridge dropped, the workspace diff in progress was lost, the run handle errored, and the runner reported a generic SDK exception. Salvaging the partial diff at the moment the bridge dropped was the obvious fix. Commit &lt;code&gt;0f95f020&lt;/code&gt; catches the bridge-drop exception, snapshots whatever is currently on disk in the workspace, and feeds that diff into the retry attempt's context so the next attempt does not start from zero.&lt;/p&gt;

&lt;p&gt;The other failure mode was uglier. When a task exhausted its retry cap, the runner reported &lt;code&gt;failure_reason = "exceeded max executions (5)"&lt;/code&gt; and that was it. The operator on the other side had no visibility into why each of those five attempts had failed. The fix in the same commit (&lt;code&gt;0f95f020&lt;/code&gt;) enriches &lt;code&gt;failure_reason&lt;/code&gt; with the last attempt's actual error string. Now when the cap is exhausted, the operator sees &lt;code&gt;"exceeded max executions (5): last attempt failed with: &amp;lt;real error&amp;gt;"&lt;/code&gt; and can route the task to a different lane or escalate.&lt;/p&gt;

&lt;p&gt;Two smaller fixes shipped alongside. Commit &lt;code&gt;1be0614f&lt;/code&gt; surfaces the AWS CLI failure when the Secrets Manager call fails. Previously the entrypoint swallowed it silently and the runner started with an empty &lt;code&gt;CURSOR_API_KEY&lt;/code&gt;, producing an opaque 401 from the SDK three seconds later. Now the entrypoint exits non-zero with the AWS CLI error before the runner even starts. Commit &lt;code&gt;64af2b50&lt;/code&gt; cleans up the per-task env injection and drops a &lt;code&gt;message&lt;/code&gt; field collision between the Cursor event schema and our internal one that was causing some events to lose their payload during translation.&lt;/p&gt;

&lt;p&gt;None of these fixes turn Composer 2.5 into a production-grade lane for our workload. They turn it into a lane I can operate, observe, and reason about while we keep iterating on it. The canary stays canary. Customer-facing projects stay on the lanes they were on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Multi-lane executor architecture is a hedge, and like all hedges, the value shows up only when you actually need it. Composer 2.5 may or may not become a default-routing lane for Codens in the coming weeks. The 10x cost compression is real, the benchmark numbers are real, and the smoke phase is also real. The point of the canary procedure is that we get to find out which of those three numbers matters for our workload before any customer feels it.&lt;/p&gt;

&lt;p&gt;The integration cost was a Phase 1 skeleton, a Phase 2 SDK wire, an ECS task definition change, an IAM fix, and a one-row SQL toggle. The integration value, regardless of whether Composer 2.5 sticks, is one more lane the harness can route through next time a pricing announcement or a model release reshapes the cost curve. That optionality is what an AI dev harness is supposed to give you.&lt;/p&gt;

&lt;p&gt;Codens is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt; if you want to see what a multi-lane harness for autonomous code repair and QA looks like in production.&lt;/p&gt;

</description>
      <category>cursor</category>
      <category>ai</category>
      <category>python</category>
      <category>aws</category>
    </item>
    <item>
      <title>"Centralizing billing across 5 products triggered a 403 nobody saw coming"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Sat, 23 May 2026 10:44:56 +0000</pubDate>
      <link>https://dev.to/zoetaka38/centralizing-billing-across-5-products-triggered-a-403-nobody-saw-coming-32ae</link>
      <guid>https://dev.to/zoetaka38/centralizing-billing-across-5-products-triggered-a-403-nobody-saw-coming-32ae</guid>
      <description>&lt;p&gt;We flipped &lt;code&gt;USE_BCP=true&lt;/code&gt; on Red at 14:02. The first 403 hit Sentry at 14:06. By 14:11 the pattern was clear: any user who tried to do something that touched org-level credit (granting a teammate access, viewing the org credit balance, kicking off a fix run under an org-scoped project) got a 403 back from the Red API, which had received a 403 from BCP, which had received a "not a member" from Auth.&lt;/p&gt;

&lt;p&gt;Staging didn't catch it. I want to be honest about that part before anything else. Staging had two users in one org, both of which had been provisioned by me through the Auth admin path months ago, so their org memberships existed in Auth's &lt;code&gt;org_members&lt;/code&gt; table by accident of history. Every code path I exercised in staging happened to read from a row that was already there. The bug only fires when a user accepts an org invitation on the product side after the cutover, and we had no synthetic flow for that in staging. Lesson noted, expensive way to learn it.&lt;/p&gt;

&lt;p&gt;This post is about what actually broke, why the design wasn't wrong (the implementation was missing), and the three branches I considered for where org-membership authority should live before settling on the one that produced the bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase H: why centralize billing now
&lt;/h2&gt;

&lt;p&gt;Codens is five products plus two platform services. Red does auto-fix, Blue does QA, Green does PRDs, Yellow is the engineering activity ledger, Purple is the orchestration layer. Auth is the identity service. BCP, the Billing Control Plane, is the newest piece and the subject of this story.&lt;/p&gt;

&lt;p&gt;Until last quarter, each product calculated its own credit consumption. That was fine when Red was the only product taking money. It became untenable around the time Green went into beta, because we had three different rounding rules, two slightly different definitions of "what counts as a billable run," and a support ticket pattern that boiled down to "my org's credit balance on Red doesn't match my org's credit balance on Blue and you charged me twice." Phase H of the architecture roadmap pulls all of that into BCP. Every product reads its credit policy from BCP, posts consumption events to BCP, and asks BCP "can this user/org afford this operation?" before starting work.&lt;/p&gt;

&lt;p&gt;The cutover is gated behind two env vars per product:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;USE_BCP&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;BCP_API_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;https://api.billing.codens.ai&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I cut one product at a time, starting with Red because it has the highest traffic and the most mature billing surface. Red PR #266 was the actual flip. Blue PR #233 and Green PR #411 followed once Red had been stable for a week. Yellow and Purple are scheduled for next quarter, both still on local credit math.&lt;/p&gt;

&lt;p&gt;The cutover order matters for this story because the 403 only manifests on org-scoped operations. Red individual-account billing kept working perfectly. So did Blue and Green individual accounts. It was specifically the org-shared credit pool path that exploded, and only for users who had joined their org through the product-side invitation flow rather than through Auth's admin console.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing the 403
&lt;/h2&gt;

&lt;p&gt;The first instinct was "BCP is misconfigured." It wasn't. BCP logs showed clean inbound requests with the right org_id, the right user_id, the right requested operation. BCP then made an internal call to Auth: "is user X a member of org Y?" Auth returned false. BCP returned 403. Red returned 403. User saw 403.&lt;/p&gt;

&lt;p&gt;The Auth log line was the clarifying one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /internal/orgs/{org_id}/members/{user_id} -&amp;gt; 404
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So Auth wasn't broken either. Auth was correctly reporting that user X was not a member of org Y, as far as Auth knew. I pulled the user out of the database. The user existed in Auth's &lt;code&gt;users&lt;/code&gt; table. The org existed in Auth's &lt;code&gt;organizations&lt;/code&gt; table. The link row in Auth's &lt;code&gt;org_members&lt;/code&gt; was missing.&lt;/p&gt;

&lt;p&gt;I went over to Red's database. The link row was there. Red had a row that said user X belonged to org Y, with the role and joined-at timestamp from the day the user accepted the invitation. Red had been authoritative for this relationship the entire time.&lt;/p&gt;

&lt;p&gt;CDTSK-1392 captured the root cause. Auth Codens is supposed to be master of organizations and memberships, but each product had grown its own &lt;code&gt;organizations&lt;/code&gt; and &lt;code&gt;org_members&lt;/code&gt; tables back when each product was a standalone service. Invitation acceptance was handled locally by each product. The row landed in the product's database, and nobody told Auth. Pre-BCP, this didn't matter, because the product was the one authorizing org-scoped operations against its own tables. Post-BCP, BCP asks Auth, Auth doesn't know, 403.&lt;/p&gt;

&lt;p&gt;The bug is not in the centralization. The bug is that we shipped centralization assuming a sync that didn't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three branches for where authority lives
&lt;/h2&gt;

&lt;p&gt;Before writing the sync, I had to decide whether the sync was even the right answer. There are three reasonable places to put authority over org membership in a multi-product setup like ours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authority in the auth service.&lt;/strong&gt; Auth is the master record. Every product holds a local cache (or a foreign-key shadow) and reflects changes back to Auth as they happen. This is what we have. It's the most conventional choice. The downside is the one we just discovered: every product-side write path that affects membership has to remember to call Auth, and forgetting is silent until something else (like BCP) starts depending on Auth being correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authority in billing itself.&lt;/strong&gt; BCP owns the org and member tables. Every product reads from BCP. This has the appeal of "the system that needs to know the truth owns the truth." It also means every product becomes hard-dependent on BCP being up to render a user's basic org context, which is a much bigger blast radius than billing being temporarily degraded. I didn't want every Red dashboard render to fail because BCP was deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authority distributed across products.&lt;/strong&gt; Each product remains the source of truth for memberships that originate in that product. BCP, when asked to authorize an org-scoped operation, routes the membership question to whichever product owns the org. This sounds clever for two products. With five products, the routing table is a permanent piece of infrastructure that has to be updated every time a new product launches, and the question "who owns this org" is itself a piece of state that has to live somewhere central. You've reinvented the auth service, badly.&lt;/p&gt;

&lt;p&gt;I chose branch one. The 403 wasn't evidence of a wrong choice. It was evidence that I'd shipped half of a choice. The half I shipped (BCP queries Auth) was correct. The half I hadn't shipped (products tell Auth about new memberships) was the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The sync endpoint
&lt;/h2&gt;

&lt;p&gt;The fix has two halves. Auth needs an endpoint that products can call. Products need to call it at the right moments.&lt;/p&gt;

&lt;p&gt;On the Auth side, I added &lt;code&gt;POST /api/v1/internal/organizations/{org_id}/members:upsert&lt;/code&gt;. The verb is &lt;code&gt;upsert&lt;/code&gt; deliberately. The endpoint is idempotent and the products call it both on invitation acceptance and on role changes, so the handler has to be willing to create or update without the caller knowing which case applies. The response status differentiates: 201 if a new membership row was created, 200 if an existing row was updated.&lt;/p&gt;

&lt;p&gt;Getting FastAPI to actually return 201 vs 200 from the same handler was the part that almost shipped broken. PR #124 was the fix. The original handler looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/organizations/{org_id}/members:upsert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;UpsertOrgMemberResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upsert_org_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UpsertOrgMemberRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_case&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UpsertOrgMemberUseCase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_upsert_use_case&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;UpsertOrgMemberResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;use_case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;UpsertOrgMemberResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you annotate the return as a Pydantic model, FastAPI takes over status code resolution and forces the default for the route (200 for POST in our config, or 201 if you set &lt;code&gt;status_code=&lt;/code&gt; on the decorator). Either way you can't branch. You get one status for both the create and the update case, which silently broke the idempotency contract for any caller that wanted to distinguish.&lt;/p&gt;

&lt;p&gt;The fix is to return &lt;code&gt;JSONResponse&lt;/code&gt; directly so the handler controls the status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/organizations/{org_id}/members:upsert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upsert_org_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UpsertOrgMemberRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_case&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UpsertOrgMemberUseCase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_upsert_use_case&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;use_case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;UpsertOrgMemberResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You lose automatic OpenAPI response model inference, which is a real cost. You get correct semantics, which is a bigger gain. I document the response shape with &lt;code&gt;responses={200: ..., 201: ...}&lt;/code&gt; on the decorator to keep the OpenAPI spec honest.&lt;/p&gt;

&lt;p&gt;On the product side, Red PR #264 added the client call at the two moments membership state changes: invitation acceptance and role update.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;accept_invitation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invitation_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;invitation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invitations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invitation_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org_members&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;invitation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;invitation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert_org_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;invitation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;invitation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invitations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_accepted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invitation_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Auth call is not in a transaction with the local write, which is a deliberate choice and a place where I might be wrong. If the local write succeeds and the Auth call fails, we have drift. The current mitigation is a nightly reconciliation job that compares product &lt;code&gt;org_members&lt;/code&gt; to Auth &lt;code&gt;org_members&lt;/code&gt; and re-upserts anything missing. I'd rather drift and reconcile than block invitation acceptance on Auth being reachable.&lt;/p&gt;

&lt;p&gt;Blue and Green shipped matching calls in their respective PRs.&lt;/p&gt;

&lt;p&gt;Side cleanup: while I was in BCP I noticed that the bonus-credit endpoint silently dropped its grant when the &lt;code&gt;grant_type&lt;/code&gt; field name on the wire didn't match what the receiver expected (the sender was using &lt;code&gt;bonus_type&lt;/code&gt;, the receiver was reading &lt;code&gt;grant_type&lt;/code&gt;, Pydantic accepted the payload with &lt;code&gt;extra="ignore"&lt;/code&gt; and quietly inserted a row with the default grant type). PR #265 fixed the Red caller and PR #231 fixed Blue. Lesson there is to not use &lt;code&gt;extra="ignore"&lt;/code&gt; on internal wire models, but that's another post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;The biggest one is that staging only catches the bugs you have data for. The org-membership row was present in staging by historical accident, so the path that read it worked. I now provision a fresh, end-to-end test user (sign up, accept invitation, perform org-scoped action) as part of pre-cutover validation, scripted, not "remember to do it."&lt;/p&gt;

&lt;p&gt;Cutting one product at a time was the only thing that kept the blast radius survivable. If I had flipped all three on the same morning the triage would have taken twice as long, because every signal would have been duplicated three ways. The order Red, then Blue, then Green wasn't load-balanced for anything clever — it was just the order I trusted the metrics on.&lt;/p&gt;

&lt;p&gt;Naming the endpoint &lt;code&gt;:upsert&lt;/code&gt; instead of overloading &lt;code&gt;POST .../members&lt;/code&gt; mattered more than I expected. When the FastAPI status code issue came up, the conversation was "the upsert endpoint should return different codes for create vs update," which is a one-sentence problem statement. If the endpoint had been &lt;code&gt;POST /members&lt;/code&gt; I'd have spent another hour arguing about whether 200 or 201 was correct in the abstract.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;The hardest part of centralizing anything across a product family is not the new service. The new service is straightforward, you write it, you deploy it, you wire up clients. The hard part is figuring out who is allowed to be the source of truth for the relationships the new service depends on, and then making every existing write path honor that choice. We chose Auth as the master for org membership, which I still think is right. We just hadn't enforced it everywhere it mattered, and BCP was the first dependent that actually cared.&lt;/p&gt;

&lt;p&gt;If you want to see how the rest of the harness fits together, the English landing page is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;. Yellow and Purple come onto BCP next quarter. I'll write that one up too, hopefully without the same shape of bug.&lt;/p&gt;

</description>
      <category>multitenant</category>
      <category>saas</category>
      <category>fastapi</category>
      <category>billing</category>
    </item>
    <item>
      <title>"When the AI gets stuck, the engineer fetches the same PRD via MCP and keeps going"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Wed, 20 May 2026 07:33:54 +0000</pubDate>
      <link>https://dev.to/zoetaka38/when-the-ai-gets-stuck-the-engineer-fetches-the-same-prd-via-mcp-and-keeps-going-52nd</link>
      <guid>https://dev.to/zoetaka38/when-the-ai-gets-stuck-the-engineer-fetches-the-same-prd-via-mcp-and-keeps-going-52nd</guid>
      <description>&lt;p&gt;Last Tuesday I watched our auto-fix agent burn through three retries on a session-handling bug and surrender. The failure mode was honest. It tried, the diff broke a test we did not know existed, it tried again, the second diff fought with an old idempotency check, the third diff was basically the first one with renamed variables. Then it stopped. The bug report sat in our system marked &lt;code&gt;analysis_failed&lt;/code&gt;, the proposed plan was there, the partial diff was there, and the engineer who had to take over was sitting in Slack scrolling.&lt;/p&gt;

&lt;p&gt;That gap, the moment between "AI gave up" and "engineer is coding," is where most AI dev tools quietly cost more than they save. The engineer cannot just resume. They have to reconstruct what the AI was looking at: which PRD section, which kickoff decision, which root cause analysis, which files the bug report pointed at. The data exists. It just lives in five places and none of them are inside the IDE.&lt;/p&gt;

&lt;p&gt;We shipped &lt;code&gt;codens-mcp&lt;/code&gt; v0.7.5 partly to close that gap. The AI workflow inside Codens reads and writes the same PRDs, bug reports, kickoffs, and run logs that an engineer can now pull into Claude Code over MCP with one call. Same source of truth. Two surfaces. The handoff loses nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 80/20 reality nobody markets
&lt;/h2&gt;

&lt;p&gt;The honest number for a well-tuned AI dev harness on real production code is somewhere between 80% and 90% of tasks completed end-to-end. The rest is novel business logic, conflicts with code the AI never saw, spec ambiguity that no amount of retry will resolve, and the long tail of edge cases that someone has to think through. I do not believe the "100% AI development" pitch and I do not think anyone shipping into real codebases does either.&lt;/p&gt;

&lt;p&gt;The 20% is not the problem. The problem is the seam between the 80% and the 20%.&lt;/p&gt;

&lt;p&gt;When the AI hands a task back, the human arrives without context. The PRD is in Notion. The bug analysis is in Sentry plus some chat thread. The kickoff decision that explains "we chose JWT not session cookies" is buried in a meeting recap. The engineer has to play archaeologist before they write a single line. And because the AI workflow has already burned through three retries, the next attempt starts from a worse position than if the engineer had been the first responder.&lt;/p&gt;

&lt;p&gt;Most AI dev tools optimize the 80%. They get better at the part the AI was already good at. The 20% gets a "human-in-the-loop" label and a button that says "request review." That button does not solve anything. The engineer still has to find everything.&lt;/p&gt;

&lt;p&gt;Codens treats the seam as the actual product. The 80% has to keep getting better, obviously. But the 20% is where the trust gets built or destroyed, and the only way to make it good is to make the takeover instantaneous.&lt;/p&gt;

&lt;h2&gt;
  
  
  One source of truth, two read paths
&lt;/h2&gt;

&lt;p&gt;Every artifact the AI produces or consumes during a task is a first-class entity in Codens, stored in Postgres, owned by a project, scoped to an org. Green Codens owns the planning side: Consultation (the requirement-gathering conversation), PRD (the structured spec), Kickoff (the implementation plan with vision, scope, tech selection, milestones), Plan (the task breakdown). Red Codens owns the repair side: Bug Report (with the AI's root cause analysis attached), Bug Fix Plan (proposed impact scope and test requirements). Purple Codens owns execution: Run (the live event stream from a workflow), Logs.&lt;/p&gt;

&lt;p&gt;The AI workflow writes to these entities through internal service calls. When the Green PRD AI generator finishes a section, it patches the PRD row. When Red's analyzer finishes, it attaches an analysis blob to the bug report. When Purple's runner emits an event, it goes to the run's event log. Nothing escapes into chat. Nothing depends on a human copying text from one tab to another.&lt;/p&gt;

&lt;p&gt;The second read path is &lt;code&gt;codens-mcp&lt;/code&gt;. It is a Python package that registers as an MCP server inside Claude Code (or any other MCP client). It authenticates with the same JWT the web app uses, talks to the same backend APIs that the AI workflow talks to, and exposes 38 tools that cover 137+ actions. When an engineer calls &lt;code&gt;green_prd(action="get", prd_id=...)&lt;/code&gt;, they get the same PRD bytes the AI agent read three retries ago.&lt;/p&gt;

&lt;p&gt;The point is not "we have an API." Every product has an API. The point is that the AI workflow and the engineer use the same access shape against the same row. There is no "engineer-facing version" of the PRD that drifts from the "AI-facing version." There is one row. Both sides read it. Both sides can write it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What codens-mcp actually exposes
&lt;/h2&gt;

&lt;p&gt;The retrieval surface that matters for a takeover is small. An engineer who arrives at a failed task needs to know: what was being built, what decisions were already made, what the AI tried, and where it broke.&lt;/p&gt;

&lt;p&gt;Install and authenticate once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;codens-mcp
codens-mcp login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;login&lt;/code&gt; runs Device Code Flow against the Codens auth service and stores a JWT at &lt;code&gt;~/.purple-codens/credentials.json&lt;/code&gt;. From that point every tool call carries the token automatically.&lt;/p&gt;

&lt;p&gt;Register the server in Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"codens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"codens-mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"serve"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the engineer, in their IDE, asks Claude to pull the bug report the AI was working on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;red_bug_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;organization_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org_abc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bug_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug_2f8a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; { id, title, description, severity, steps_to_reproduce,
#      expected_behavior, actual_behavior, affected_files,
#      analysis: { root_cause, evidence, suspected_files }, ... }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;action&lt;/code&gt; parameter pattern is the whole reason 38 tools cover 137+ operations. One &lt;code&gt;green_prd&lt;/code&gt; tool handles create, list, get, update, delete, update_section, approve, submit_for_review, request_changes, archive, unarchive, link_notion, unlink_notion, and consistency-check. The tool descriptor that the model loads at startup is one short signature, not fifteen. (We have written separately about why that matters for context budget — the short version is that a five-server stack burns 55K tokens advertising itself before any work; &lt;code&gt;codens-mcp&lt;/code&gt; burns under 5K for everything.)&lt;/p&gt;

&lt;p&gt;For a takeover the engineer typically chains two or three calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;green_kickoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kickoff_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kck_7a1c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; vision, scope, non-goals, tech selection, milestones
&lt;/span&gt;
&lt;span class="nf"&gt;green_plan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plan_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pln_91de&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; ordered task list with status and dependencies
&lt;/span&gt;
&lt;span class="nf"&gt;purple_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_be40&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; last events, failure reason, partial outputs
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three calls. Maybe forty seconds. The engineer now has the same view of the work that the AI had when it gave up, without leaving the IDE and without reading a single Slack thread.&lt;/p&gt;

&lt;h2&gt;
  
  
  Walking through a real takeover
&lt;/h2&gt;

&lt;p&gt;The Tuesday session-handling bug. Here is what actually happened after the third retry failed.&lt;/p&gt;

&lt;p&gt;The on-call engineer opened their IDE. Claude Code was already running with &lt;code&gt;codens-mcp&lt;/code&gt; registered. They typed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Pull bug report &lt;code&gt;bug_2f8a&lt;/code&gt; and the latest fix plan."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude called &lt;code&gt;red_bug_report(action="get", bug_id="bug_2f8a")&lt;/code&gt; and &lt;code&gt;red_bug_fix_plan(action="get_by_bug", bug_id="bug_2f8a")&lt;/code&gt; in parallel. Both returned in under a second. The analysis pointed at the auth middleware. The fix plan listed the three files the AI thought needed to change and the test it expected to pass. The engineer read it in maybe two minutes.&lt;/p&gt;

&lt;p&gt;Then they asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What did the last Purple run actually do?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude called &lt;code&gt;purple_run(action="get_status", run_id=...)&lt;/code&gt; and &lt;code&gt;purple_run(action="subscribe_events", run_id=...)&lt;/code&gt; for replay. The event log showed exactly which test had failed on each retry and why the third retry had effectively reverted to the first. The AI had been bouncing between two incompatible local minima.&lt;/p&gt;

&lt;p&gt;That was the engineer's "aha." The fix plan was conceptually right, but the test the AI was retrying against was wrong, written by an earlier feature, asserting a behavior the new spec explicitly changed. The engineer fixed the test, applied the AI's second-attempt diff with a four-line manual adjustment, and shipped it. From bug report open to PR merged: 23 minutes, including reading.&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;codens-mcp&lt;/code&gt; that same takeover would have been: open Sentry, search by ticket, copy stack trace, open Notion, find the PRD by title, scroll to the right section, open the chat thread where the kickoff lived, find the test naming pattern, grep the repo, then start coding. I have timed that path on myself. It is between 25 and 45 minutes before the first edit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff
&lt;/h2&gt;

&lt;p&gt;The price of "one source of truth, two read paths" is schema discipline. Every artifact has to be modeled well enough that the AI workflow and the engineer both find what they need in it. You cannot let the PRD turn into a Markdown blob with five conflicting section conventions, because the AI's &lt;code&gt;update_section&lt;/code&gt; action and the engineer's &lt;code&gt;get_section&lt;/code&gt; reader both depend on the structure being honest. You cannot let the bug report become a free-text field with the root cause analysis stuffed at the bottom in a different format every time, because the takeover tooling that highlights &lt;code&gt;analysis.suspected_files&lt;/code&gt; will silently miss them.&lt;/p&gt;

&lt;p&gt;This is heavier upfront than the alternative, which is to let each side render its own view. The alternative loses every time. The drift between "what the PM thinks the spec says" and "what the engineer thinks the spec says" is, in my experience, the single biggest source of bugs in features that get partially built by an AI. The schema discipline pays for itself the first time a takeover succeeds in under thirty minutes.&lt;/p&gt;

&lt;p&gt;The other cost is honest: we run on the Anthropic API direct path, with per-token billing and our own multi-model routing across Claude and Qwen. That gives us control over the escalation path (AI workflow to engineer manual takeover via MCP) independent of what any single platform decides about subscription-tier agent access. When the platform shifts, the takeover path does not move.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;Graceful degradation is the unappreciated half of AI dev tool design. Anyone can build an agent that succeeds on the easy 80%. The teams that ship into real production code earn their trust on the 20% where the agent gives up and a human takes over. The only way to make that takeover not feel like a downgrade is to make the data the human needs be exactly the data the agent had, in the same shape, one tool call away.&lt;/p&gt;

&lt;p&gt;That is what &lt;code&gt;codens-mcp&lt;/code&gt; is. The AI does most of the work. When it cannot, the engineer reads the same row.&lt;/p&gt;

&lt;p&gt;Codens English landing: &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;&lt;br&gt;
&lt;code&gt;codens-mcp&lt;/code&gt; on PyPI: &lt;a href="https://pypi.org/project/codens-mcp/" rel="noopener noreferrer"&gt;https://pypi.org/project/codens-mcp/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>claude</category>
      <category>architecture</category>
    </item>
    <item>
      <title>"One JWT, five services, and the python-jose audience list trap"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Sat, 16 May 2026 04:34:53 +0000</pubDate>
      <link>https://dev.to/zoetaka38/one-jwt-five-services-and-the-python-jose-audience-list-trap-5e3i</link>
      <guid>https://dev.to/zoetaka38/one-jwt-five-services-and-the-python-jose-audience-list-trap-5e3i</guid>
      <description>&lt;p&gt;&lt;code&gt;audience must be a string or None&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That was the exception python-jose threw the moment our unified MCP server tried to talk to the second backend behind it. The token was valid. The signature checked out. The claims were correct. The library just refused to accept a list as the expected audience, and the JWT spec disagrees with the library on whether that should be a problem.&lt;/p&gt;

&lt;p&gt;We run a single MCP server, &lt;code&gt;codens-mcp&lt;/code&gt; on PyPI, that fronts five backends: Red (auto-fix), Blue (QA), Green (PRD), Purple (orchestration), and Auth. One MCP token, five destinations. When Claude calls a Red tool, the MCP server proxies an HTTP request to the Red backend carrying that same token. Same for Blue, Green, Purple, Auth. Each backend has its own primary audience for its own user-facing tokens, and we wanted all of them to also accept the MCP server's token without minting five service-specific JWTs per session.&lt;/p&gt;

&lt;p&gt;This is the story of how that ran into a python-jose quirk, and the 12-line workaround we ended up shipping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture, briefly
&lt;/h2&gt;

&lt;p&gt;Codens exposes 31 tools across the five product surfaces through one MCP server. From Claude's side it is a single connection. From the backends' side, each one sees a normal authenticated HTTP request with a bearer token in the header. The token is issued by the Auth service. Its &lt;code&gt;aud&lt;/code&gt; claim is &lt;code&gt;purple-codens-mcp&lt;/code&gt;, because the MCP server is the thing the user logged into when they connected their client.&lt;/p&gt;

&lt;p&gt;Each backend already had its own audience for its first-party tokens. Green expects &lt;code&gt;green-codens&lt;/code&gt;. Red expects &lt;code&gt;red-codens&lt;/code&gt;. And so on. Those audiences were baked into the OAuth verifier and matched the audience claim on tokens minted by that service's own login flow.&lt;/p&gt;

&lt;p&gt;We had two ways forward.&lt;/p&gt;

&lt;p&gt;The first option: mint five tokens per MCP session. The MCP server logs into Red, Green, Blue, Purple, and Auth as the user, gets five JWTs, and selects the right one based on which tool the user invoked. This is conceptually clean. It also means five times the token issuance, five rotation surfaces, five sets of refresh flows to coordinate, and a routing layer in the MCP server that has to know which token belongs to which tool. None of that adds value.&lt;/p&gt;

&lt;p&gt;The second option: mint one token, declare its audience as &lt;code&gt;purple-codens-mcp&lt;/code&gt;, and teach every backend to accept that audience in addition to its own primary one. The MCP server holds one credential. Each backend keeps its primary audience for its own native flows and additionally trusts MCP-issued tokens. Rotation surface stays small. The routing logic in the MCP server disappears.&lt;/p&gt;

&lt;p&gt;We picked option two. The plan was to add a per-service config that lists additional accepted audiences, expand the verifier to check against the union, and ship it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix v1: pass a list to python-jose
&lt;/h2&gt;

&lt;p&gt;The setting looked like this in every backend service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_AUDIENCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;green-codens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purple-codens-mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The verifier change looked equally innocuous. python-jose's &lt;code&gt;jwt.decode&lt;/code&gt; accepts an &lt;code&gt;audience&lt;/code&gt; keyword. The naive reading of every JWT tutorial on the internet says you give it the expected audience and it checks the token's &lt;code&gt;aud&lt;/code&gt; against that. So we built a list of accepted audiences and handed it over:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verify_audience&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;audiences&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jwt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;algorithms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;algorithm&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the version we wrote, ran a quick local smoke test against, and pushed to the dev environment thinking the work was done. The shape of the change matched the shape of the problem. A list of allowed audiences in, an &lt;code&gt;aud&lt;/code&gt; claim checked against that list, request accepted. Done.&lt;/p&gt;

&lt;p&gt;The dev environment, of course, immediately disagreed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap
&lt;/h2&gt;

&lt;p&gt;The MCP server made its first call into Green and the request came back as a 401. The Green logs had the actual exception underneath the generic auth failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TypeError: audience must be a string or None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;python-jose's &lt;code&gt;jwt.decode&lt;/code&gt; does not accept a list for its &lt;code&gt;audience&lt;/code&gt; parameter. If you pass one, it raises before it even looks at the token. The library has only ever supported single-string audience verification. There is no flag, no overload, no helper that takes a list.&lt;/p&gt;

&lt;p&gt;RFC 7519 is unambiguous on the other side of this question. Section 4.1.3 defines &lt;code&gt;aud&lt;/code&gt; as either a single case-sensitive string or an array of case-sensitive strings, and verification logic is supposed to check that the recipient identifies itself with at least one of the values present. The spec assumes set membership semantics on both ends. The token can have multiple audiences, and the verifier can accept multiple audiences. Whether either side is a list is a transport detail.&lt;/p&gt;

&lt;p&gt;python-jose is one of the most-used Python JWT libraries. Most FastAPI tutorials reach for it without thinking. It is also old, and the maintainer activity is thin. There is a multi-year-old GitHub issue tracking exactly this limitation, with patches floating around in forks and pull requests that never merged. The library's behavior is what it is, and if you need list audience verification, you are on your own.&lt;/p&gt;

&lt;p&gt;The honest read here is that the JWT spec describes capability and most libraries describe a comfortable subset of it. The subset is usually fine. The moment you do anything cross-service it stops being fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix v2: decode without audience verification, then verify manually
&lt;/h2&gt;

&lt;p&gt;The fix that worked is to use python-jose for what it is good at, which is signature verification and claim decoding, and do the audience check ourselves. python-jose lets you disable individual claim checks through its &lt;code&gt;options&lt;/code&gt; dict. &lt;code&gt;verify_aud: False&lt;/code&gt; turns off the built-in audience verification entirely. The signature, expiry, issuer, and everything else still get checked. We just take responsibility for &lt;code&gt;aud&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;should_verify_aud&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verify_audience&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jwt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;algorithms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;algorithm&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify_aud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;should_verify_aud&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;allowed_audiences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;token_aud&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;token_aud_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token_aud&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_aud_set&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;allowed_audiences&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;InvalidTokenError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid audience: token aud=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt;, expected one of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allowed_audiences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The set intersection does the entire job. &lt;code&gt;token_aud_set &amp;amp; allowed_audiences&lt;/code&gt; returns a set of values present in both, and if that set is empty the token is for someone else and we reject it. If the token's &lt;code&gt;aud&lt;/code&gt; is a single string we wrap it in a one-element set. If it is a list we convert directly. If it is missing we get an empty set and the intersection is empty, which fails closed.&lt;/p&gt;

&lt;p&gt;One subtle thing about the order. We compute &lt;code&gt;should_verify_aud&lt;/code&gt; before calling &lt;code&gt;jwt.decode&lt;/code&gt;, not after, because we want the variable to capture the caller's intent independent of what python-jose returns. If someone passes &lt;code&gt;verify_audience=False&lt;/code&gt;, we skip the manual check entirely. If they pass &lt;code&gt;verify_audience=True&lt;/code&gt; but the service has no configured audience, there is nothing to verify against, so we also skip. The manual block only runs when there is something real to check.&lt;/p&gt;

&lt;p&gt;The error message includes both the token's actual &lt;code&gt;aud&lt;/code&gt; value and the sorted list of audiences we accept. When you debug an inter-service auth failure at 2am, the only thing worse than a 401 with no detail is a 401 that tells you nothing about the mismatch. The cost of formatting that message into the exception is zero and the time it saves is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bonus pattern: decode and verify as separate steps
&lt;/h2&gt;

&lt;p&gt;Once you have done this once, decoupling decoding from verification starts to feel like the right default for any JWT code that has to do anything non-trivial. The library is good at parsing the structure and confirming the signature. Your service is the one that knows which claims matter and what acceptance looks like.&lt;/p&gt;

&lt;p&gt;The same pattern handles a bunch of adjacent problems. Token introspection for audit logs without re-running all the checks. Soft expiry where you log a warning at 90 percent of the lifetime instead of rejecting. Migration windows where you accept tokens signed with either the old or new key for a week. Custom claim validation that the library has never heard of. Whenever a future library bug lands in the issuer check or the expiry math, you have an escape hatch already in place because the verification logic is yours.&lt;/p&gt;

&lt;p&gt;This is also the answer even if python-jose ships list audience support tomorrow. You do not lose anything by owning the audience check. You gain a place to put the next requirement that does not fit cleanly into a kwarg.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;Multi-service authentication keeps running into the gap between what JWT can do and what the convenient libraries actually do. The spec is generous. The libraries are opinionated. When you stitch services together, the opinions usually have to give.&lt;/p&gt;

&lt;p&gt;The unified-token path was worth the workaround. One JWT, one rotation, one issuer, five backends that each know how to accept it. The cost was a dozen lines of manual verification in a shared OAuth module. We would make the same trade again.&lt;/p&gt;

&lt;p&gt;If you want to see how Codens uses this on the agent side, the English landing page is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;. The MCP server is &lt;code&gt;codens-mcp&lt;/code&gt; on PyPI and it is what the agent connects to when it needs to talk to any of the five product surfaces.&lt;/p&gt;

</description>
      <category>jwt</category>
      <category>python</category>
      <category>fastapi</category>
      <category>auth</category>
    </item>
    <item>
      <title>"Claude 3, Qwen 6: why we set a different fix_verify retry cap per model"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Fri, 15 May 2026 07:58:45 +0000</pubDate>
      <link>https://dev.to/zoetaka38/claude-3-qwen-6-why-we-set-a-different-fixverify-retry-cap-per-model-oce</link>
      <guid>https://dev.to/zoetaka38/claude-3-qwen-6-why-we-set-a-different-fixverify-retry-cap-per-model-oce</guid>
      <description>&lt;p&gt;Claude gets 3 retries. Qwen gets 6. Everything else gets 5.&lt;/p&gt;

&lt;p&gt;That is the default &lt;code&gt;fix_verify_retry_cap&lt;/code&gt; in Codens Purple right now, after a few weeks of staring at fix-rate curves per model. It started as one global cap, the same number for every model the workflow could route to. We changed it once we had enough production data to see that the same number was both too high for one model and too low for another at the same time.&lt;/p&gt;

&lt;p&gt;This is the story of the split, what the loop actually does, and the few lines of code that put the policy in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix_verify loop
&lt;/h2&gt;

&lt;p&gt;Codens Purple runs an agent that proposes a code fix, then verifies it by running a test or a check, then decides whether to retry with feedback from the verification step. The loop looks roughly like this. Generate a candidate change, apply it, run the verify command, read the result. If verify passes, the loop is done. If verify fails, feed the failure output back into the next prompt and try again. Each retry is a new API call. Each API call costs per-token credits, and verify itself costs wall clock time plus whatever the test suite costs to run.&lt;/p&gt;

&lt;p&gt;The retry cap is the integer that says how many of those iterations the loop is allowed before it gives up and surfaces the partial result to the user. A cap of 1 means one attempt, no retry. A cap of 3 means an initial attempt plus two retries. A cap of 6 means up to six attempts total.&lt;/p&gt;

&lt;p&gt;The cap matters because the curve of "fix succeeds at attempt N" is not flat. It is heavily front-loaded. Most successful fixes succeed on attempt 1 or 2. The question for any given model is how long the long tail is, and how much of that tail is worth paying for.&lt;/p&gt;

&lt;p&gt;When we had one cap for all models, that one number had to be a compromise. The compromise was bad in two directions at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we got to multi-model
&lt;/h2&gt;

&lt;p&gt;Codens started with Claude as the only model. Specifically, Claude via the Anthropic API, using a raw API key with per-token billing. Not the subscription, not the bundled tier. We are a multi-tenant product running thousands of small &lt;code&gt;fix_verify&lt;/code&gt; cycles per day across many customers, and a subscription does not cleanly support that shape of workload. Per-token billing lets us scale spend with usage and attribute cost back to the project that incurred it.&lt;/p&gt;

&lt;p&gt;This came up again recently when Anthropic announced that the &lt;code&gt;claude -p&lt;/code&gt; print mode, the Agent SDK, and CI use cases now require an API plan rather than a subscription. For us this was a non-event. We were already on the API. The announcement just confirmed that the path we picked is the path Anthropic wants production agent workloads to take.&lt;/p&gt;

&lt;p&gt;Claude is excellent for &lt;code&gt;fix_verify&lt;/code&gt;. The per-attempt success rate is high and the failure modes are usually informative, meaning when it does not fix the bug on attempt 1, the diff it produces and the verify output together give the next attempt a real signal. The downside is cost. At scale, with thousands of fix loops a day, the per-token bill is a real line item.&lt;/p&gt;

&lt;p&gt;A few months in, we started evaluating Qwen as a secondary model to drive cost down on a subset of tasks. Qwen runs on our own infrastructure on AWS EC2 hosts, which gives us per-token cost well below the Anthropic API for the same task size. The tradeoff was the reliability profile. Per-attempt success rate is lower than Claude. Failure modes are noisier. Some of the time the model will produce a syntactically valid but semantically wrong patch, and the verify step is the only thing that catches it.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of model where retries earn their keep. Qwen's curve of cumulative success vs attempt number rises more slowly than Claude's, but it keeps rising further out. Attempt 5 is still adding meaningful success rate. With Claude, attempt 5 is mostly wasted credits on a fundamentally wrong understanding that more retries are not going to fix.&lt;/p&gt;

&lt;p&gt;So we had two models in production with different shapes of success curve, and we were applying the same retry cap to both. Something had to give.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one cap did not work
&lt;/h2&gt;

&lt;p&gt;Suppose we set the global cap to 3, tuned for Claude. Claude is fine. Qwen leaves real success on the table, because attempts 4, 5, and 6 would have converted a measurable fraction of failures into passes, and now they do not happen. Fix rate drops on Qwen-routed tasks. Users notice. They route more work to Claude, which is the opposite of what we wanted from introducing Qwen.&lt;/p&gt;

&lt;p&gt;Suppose we set the global cap to 6, tuned for Qwen. Qwen is fine. Claude wastes credits. Attempts 4, 5, and 6 on a Claude-routed task that has already failed three times have a low chance of succeeding, because Claude's failure mode at attempt 3 is usually "I do not understand the bug" or "the test I am running is checking something I cannot see," and the same prompt with the same verify output is not going to flip that on attempt 6. We were paying full Sonnet-tier per-token cost for those attempts.&lt;/p&gt;

&lt;p&gt;The compromise we ran for a while was a cap of 5 globally. It was bad on both axes. Claude wasted 2 attempts worth of credits on its failure cases. Qwen left 1 attempt worth of success on the floor. We could see this in the data once we started bucketing the loop outcome by model and attempt number. The right answer was clearly per-model, not global.&lt;/p&gt;

&lt;h2&gt;
  
  
  The per-model defaults
&lt;/h2&gt;

&lt;p&gt;The implementation is small. We added a nullable integer column on the project table, &lt;code&gt;fix_verify_retry_cap&lt;/code&gt;, with NULL meaning "use the model-based default." A helper function returns the default for a given model name. The use case layer combines the two when it kicks off a loop.&lt;/p&gt;

&lt;p&gt;The helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_default_fix_verify_cap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The schema field, on the project update payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PurpleProjectUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;fix_verify_retry_cap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Alembic migration adds the column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purple_projects&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix_verify_retry_cap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the use case resolves the effective cap when it starts a task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;effective_cap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fix_verify_retry_cap&lt;/span&gt;
    &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_default_fix_verify_cap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;execute_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The override range is 1 to 20. One on the low end because some projects have run a single attempt followed by a human review, and we do not want to break that pattern. Twenty on the high end because it is a reasonable ceiling for a customer who wants to push the long tail of a cheap self-hosted model further than our default. If they set 20 and burn through it, that is their cost. We log the effective cap on every task so it shows up in the project audit log alongside the outcome.&lt;/p&gt;

&lt;p&gt;The defaults of 3, 5, 6 are not magic numbers pulled out of intuition. We picked them by plotting cumulative fix rate against attempt number for each model from a few weeks of production runs and looking at where the curve flattens. For Claude, the curve is essentially flat past attempt 3. For Qwen, it is still meaningfully rising at 5 and starts to flatten at 6. For other models we had less data, so 5 is the safe middle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff
&lt;/h2&gt;

&lt;p&gt;The honest cost of this change is that adding a new model to the routing layer is no longer free. Before, we added a model and it inherited the global cap. Now we have to pick a default. If we do not pick one, the model falls through to the 5 default, which is usually fine but not always optimal.&lt;/p&gt;

&lt;p&gt;In practice, this turned into a small ritual when introducing a new model. Route a small fraction of traffic to it at cap 8 or 10 for a week, plot the curve, find the elbow, set the default to one or two above the elbow. The ritual takes a few hours of analysis on top of the model integration itself. We considered automating it, computing the default from rolling fix rates per model on a cadence. We have not built that yet. The set of models we route to is small enough that a manual review every couple of months is fine. If the set grew to ten or more, automation would start to pay back.&lt;/p&gt;

&lt;p&gt;The other tradeoff is that the policy is now opinionated in a way users can feel. If a customer on a Claude-routed project reports "fix gave up too early," the answer is sometimes "the default cap is 3, raise it to 5 on your project and try again." That is a real conversation we have had. It is the price of a default that is right on average but not for every codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the cap is, really
&lt;/h2&gt;

&lt;p&gt;A retry cap is a budget. Specifically, it is a budget that integrates two things at once. The marginal probability of success at each attempt. The marginal cost of each attempt. The optimal cap is the largest N where the expected value of attempt N is still positive, which means attempt N's marginal success times the value of a fix exceeds attempt N's marginal cost in credits and verify time. That number is per-model because both factors are per-model.&lt;/p&gt;

&lt;p&gt;When we set 3 for Claude and 6 for Qwen, we are saying the integral converges faster on Claude because high per-attempt success runs out of incremental room quickly, and converges slower on Qwen because lower per-attempt success keeps adding incremental room for longer at a much lower per-attempt cost. The split is what makes a multi-model workflow economically coherent.&lt;/p&gt;

&lt;p&gt;If you are running anything like this loop in production, do not pick one number for all your models. Plot the curve. The number falls out.&lt;/p&gt;

&lt;p&gt;Codens Purple is part of the harness at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt; . The retry cap split lives in &lt;code&gt;purple-codens&lt;/code&gt; under the project use case layer.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>python</category>
      <category>anthropic</category>
    </item>
    <item>
      <title>"When 'Control request timeout: initialize' actually means SIGKILL: Claude Code CLI OOM inside Celery"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Thu, 14 May 2026 00:08:43 +0000</pubDate>
      <link>https://dev.to/zoetaka38/when-control-request-timeout-initialize-actually-means-sigkill-claude-code-cli-oom-inside-n0o</link>
      <guid>https://dev.to/zoetaka38/when-control-request-timeout-initialize-actually-means-sigkill-claude-code-cli-oom-inside-n0o</guid>
      <description>&lt;p&gt;A production Celery task in Codens Green started returning this, intermittently, only under real load:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Control request timeout: initialize
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The string is suspiciously specific. It looks like the kind of message you would see if Claude Code CLI's MCP initialization handshake had timed out on the other side of a pipe. That is what it sounds like. That is not what it was.&lt;/p&gt;

&lt;p&gt;The task is &lt;code&gt;analyze_code_specification&lt;/code&gt;. It spawns Claude Code CLI as a subprocess to analyze a repository against a PRD. It worked in staging, worked locally, worked in CI. It failed in production a few times a day, almost always when more than one analysis was running at the same time.&lt;/p&gt;

&lt;p&gt;What we eventually shipped: route that task to a dedicated Celery queue, run that queue on a separate ECS Fargate worker tier with 8 GB of memory, pin concurrency to 1. The real bug was the Linux kernel OOM killer terminating Claude Code CLI partway through startup, before it could complete its handshake with the parent task. The misleading log line was just what survives when a child process is shot in the head mid-init.&lt;/p&gt;

&lt;p&gt;This is the chase.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrong paths
&lt;/h2&gt;

&lt;p&gt;I spent the better part of a day inside Claude Code CLI's initialization code path, because that is where the error string lived.&lt;/p&gt;

&lt;p&gt;First theory: stdio buffering. The CLI talks to the parent over stdin/stdout. If the parent is not reading fast enough, the child can block on a full pipe and look like it is hanging. I added explicit buffer drains, raised the timeout, switched to line-buffered mode on both sides. The error still happened.&lt;/p&gt;

&lt;p&gt;Second theory: MCP protocol version mismatch. Maybe a recent Claude Code update changed the init handshake and our version pin was stale. I diffed the changelog, compared protocol versions across our deployed image and a known-good local environment. They matched.&lt;/p&gt;

&lt;p&gt;Third theory: a bug in the agent SDK config. We pass a lot of options into the CLI. Maybe one of them was triggering a slow path during init that exceeded the handshake budget. I trimmed the config down to the smallest reproducible set, then to nothing. Same error in production. Still nothing in staging.&lt;/p&gt;

&lt;p&gt;Fourth theory, the one I am least proud of: maybe Claude Code itself has an upstream init bug under concurrent load. I drafted half of a GitHub issue before I noticed I had no actual evidence and was just frustrated.&lt;/p&gt;

&lt;p&gt;None of these held up. The fingerprint of the failure, intermittent, only under load, only in production, did not match any of them. Buffering bugs are deterministic. Protocol mismatches are deterministic. Config bugs are deterministic. This was load-correlated. That is a different shape of problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exit code
&lt;/h2&gt;

&lt;p&gt;The thing that finally cracked it was looking at the subprocess exit code instead of the log message. We were capturing the error string before we captured &lt;code&gt;returncode&lt;/code&gt;, and the error string was so plausible it had crowded out the rest of the diagnostic surface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;proc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_subprocess_exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;communicate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude code failed rc=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The value coming out was &lt;code&gt;-9&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On POSIX, when &lt;code&gt;subprocess&lt;/code&gt; reports a negative return code, the absolute value is the signal that killed the child. Signal 9 is SIGKILL. SIGKILL cannot be caught, cannot be handled, cannot be cleaned up after. The process is removed from the run queue. There is exactly one common source of SIGKILL on Linux that arrives without a parent or operator sending it on purpose: the kernel OOM killer.&lt;/p&gt;

&lt;p&gt;That was the moment. This is no longer a Claude Code problem. This is an OS-level problem. The CLI had not timed out during initialization. The CLI had been shot during initialization, by the kernel, for using too much memory.&lt;/p&gt;

&lt;p&gt;The "Control request timeout: initialize" message was a downstream symptom. The parent task was waiting for the child to finish its handshake. The child was killed mid-handshake. The parent eventually gave up waiting and surfaced the most specific thing it knew, which was that init had not completed in time. The error was technically true and completely misleading.&lt;/p&gt;

&lt;h2&gt;
  
  
  OOM math
&lt;/h2&gt;

&lt;p&gt;Once you know the shape, the math is easy.&lt;/p&gt;

&lt;p&gt;Claude Code CLI is not a small process. It boots a JavaScript runtime, loads the agent SDK, hydrates context, and prepares for tool calls. In our workload, resident memory per invocation sits between roughly 500 MB and 1.5 GB, peaking higher during initial context load.&lt;/p&gt;

&lt;p&gt;Our Celery worker pool was the general-purpose one. Sized for the rest of our tasks, which are normal Python work: webhook fan-out, database writes, small HTTP calls. Those tasks live happily in well under 200 MB each. The worker host had memory headroom appropriate to that profile, with default Celery concurrency, which spins up multiple worker processes per host so several tasks run in parallel.&lt;/p&gt;

&lt;p&gt;That is fine for normal traffic. It is not fine when two of those parallel tasks each decide to spawn a 1+ GB CLI subprocess.&lt;/p&gt;

&lt;p&gt;Picture the failure mode. Two PRDs are submitted within the same minute. Two Celery workers pick up &lt;code&gt;analyze_code_specification&lt;/code&gt;. Each launches Claude Code CLI. Both CLIs start allocating. The host's resident memory climbs past its limit. The kernel's OOM killer wakes up and picks a victim, typically the largest recent allocator. Claude Code CLI dies with SIGKILL. The Celery task surfaces "Control request timeout: initialize" because that is what it saw from its end of the pipe. The other task may or may not also die, depending on timing.&lt;/p&gt;

&lt;p&gt;The reason this never showed up in staging was simple: staging has one user, me, running one job at a time. Concurrency was always 1 by accident. The bug needed two simultaneous invocations on the same host to express itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix, in four parts
&lt;/h2&gt;

&lt;p&gt;I did not want to over-engineer this. The fix is structurally small. It is mostly Celery routing and infra sizing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Dedicated queue.&lt;/strong&gt; &lt;code&gt;analyze_code_specification&lt;/code&gt; got its own queue, separated from everything else.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# celery_app.py
&lt;/span&gt;&lt;span class="n"&gt;task_routes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.analyze_code_specification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.run_fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fixing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.control_plane.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;control_plane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.plan_monitor.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan_monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# everything else falls through to "default"
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point of the queue split is not load balancing. It is so we can attach a different worker profile to this task without changing anything about the others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dedicated ECS Fargate worker tier.&lt;/strong&gt; The &lt;code&gt;analysis&lt;/code&gt; queue gets its own worker service, on its own Fargate task definition, with 8 GB of memory. The rest of the workers stay on the smaller general-purpose host. One service, one queue, one process shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Concurrency = 1.&lt;/strong&gt; The worker for the &lt;code&gt;analysis&lt;/code&gt; queue starts like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;celery &lt;span class="nt"&gt;-A&lt;/span&gt; app worker &lt;span class="nt"&gt;-Q&lt;/span&gt; analysis &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 1 &lt;span class="nt"&gt;--loglevel&lt;/span&gt; info
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the load-bearing piece. Even on an 8 GB host, if you let two CLI invocations run in parallel, you can still blow past the limit when both peak at 1.5 GB at the same time and the OS plus worker plus everything else has its own footprint. Concurrency 1 means exactly one Claude Code CLI subprocess exists on this host at any time. Two analyses come in, the second one queues, waits, runs next. Slower, totally fine, never OOMs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Memory headroom.&lt;/strong&gt; 1 CLI × roughly 1.5 GB peak × concurrency 1, against 8 GB total, with the worker process and OS taking a few hundred MB. That gives more than 5 GB of headroom for a worst-case CLI invocation. If we ever needed to raise concurrency to 2, we would also need to either double the instance size or accept the OOM risk back. We chose not to.&lt;/p&gt;

&lt;p&gt;We also added regression tests at the routing layer, asserting that &lt;code&gt;analyze_code_specification&lt;/code&gt; resolves to the &lt;code&gt;analysis&lt;/code&gt; queue, that control-plane tasks do not accidentally get rerouted there, and that plan-monitor isolation is preserved. The routing dict is the kind of thing that quietly bit-rots in a PR review, and a misroute would silently bring the bug back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;The dedicated worker tier is more expensive per task than just bumping the general worker's RAM. It scales slower under burst load because the queue depth gates throughput. It is one more service to deploy, monitor, alert on, and update during a Claude Code CLI version bump. None of that is free.&lt;/p&gt;

&lt;p&gt;What we got in return is that this failure mode cannot happen anymore for any reason that is not "we accidentally raised concurrency above 1." That is a single config line in one repo with a test guarding it. I will take that tradeoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What generalizes
&lt;/h2&gt;

&lt;p&gt;Two things stuck with me after this.&lt;/p&gt;

&lt;p&gt;One: when a child process surfaces a plausible-sounding error during a handshake, check &lt;code&gt;returncode&lt;/code&gt; before you check the message. A negative return code on POSIX is a different category of failure from anything the application itself can report. A negative number is the OS telling you the application never got a chance.&lt;/p&gt;

&lt;p&gt;Two: per-task memory profiles matter for Celery worker sizing in a way that defaults do not protect you from. A worker pool tuned for 200 MB tasks will silently kill a 1.5 GB task and tell you something else happened. If your task spawns a subprocess that is heavier than your worker, the right answer is almost always a separate queue with its own concurrency and its own host, not a bigger general-purpose host.&lt;/p&gt;

&lt;p&gt;We build Codens, an AI dev harness with this kind of analysis baked in. &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>celery</category>
      <category>python</category>
      <category>debugging</category>
    </item>
    <item>
      <title>"Cutting MCP token bloat by 12x: what happened when we packed 31 tools into one server"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Tue, 12 May 2026 02:49:09 +0000</pubDate>
      <link>https://dev.to/zoetaka38/cutting-mcp-token-bloat-by-12x-what-happened-when-we-packed-31-tools-into-one-server-4149</link>
      <guid>https://dev.to/zoetaka38/cutting-mcp-token-bloat-by-12x-what-happened-when-we-packed-31-tools-into-one-server-4149</guid>
      <description>&lt;p&gt;Earlier this week &lt;a href="https://twitter.com/akshay_pachaar" rel="noopener noreferrer"&gt;@akshay_pachaar&lt;/a&gt; summarized a year of MCP-vs-CLI arguing into one sharp line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The MCP vs CLI debate. For most of 2025, AI Engineers argued about it. The skeptics had real numbers: Playwright MCP eats 13.7K tokens, Chrome DevTools MCP eats 18K. A 5-server setup burns 55K tokens before any work."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He is right. Those numbers are the steady drumbeat against MCP as a delivery format. If your agent burns 55K tokens just advertising capabilities, the protocol starts to look like a tax.&lt;/p&gt;

&lt;p&gt;We just shipped a counter-data point. &lt;code&gt;codens-mcp&lt;/code&gt; is a single Python package that exposes 31 tools across five products (Purple, Red, Blue, Green, Auth, plus a cross-product registration tool). I sat down with &lt;code&gt;wc -c&lt;/code&gt; and a calculator and got a number I had to triple-check: the entire tool surface, descriptions and all, is ~4,720 tokens. That is roughly 12x less than the 5-server number in the tweet, and about 3x less than Playwright MCP alone.&lt;/p&gt;

&lt;p&gt;This is not a "look how clever we are" post. It is the boring engineering answer: most of MCP's token cost is not the protocol, it is the loading strategy. Below I walk through how we measured it, the five architecture decisions that made the number small, and the real tradeoffs we ate to get there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The measurement
&lt;/h2&gt;

&lt;p&gt;Here is the actual byte count from the tool definition files, straight off disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;auth_tools.py     1,555 chars
blue_tools.py     2,576 chars
cross_tools.py    3,913 chars
green_tools.py    6,160 chars
purple_tools.py   1,448 chars   # re-exports 16 tools from purple-codens-mcp
red_tools.py      3,231 chars
                 ───────
total            18,883 chars  ≈ 4,720 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 4 chars/token heuristic is a known underestimate for natural-language English (3.5 is closer to GPT/Claude tokenizers in practice), but it is fine as an upper-bound on a registration payload that contains a mix of Python identifiers, docstrings, and JSON-schema-ish hints. The MCP server sends a slightly inflated version of these definitions over the wire as tool descriptors, so the on-context cost the model sees is in the same order of magnitude. I have done the apples-to-apples comparison with &lt;code&gt;tiktoken&lt;/code&gt; on the rendered descriptors and the number lands between 4.4K and 5.1K depending on whether you count the JSON schema framing. ~4,720 is the honest middle.&lt;/p&gt;

&lt;p&gt;The 31 tools break down like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purple (16, re-exported from &lt;code&gt;purple-codens-mcp&lt;/code&gt;): &lt;code&gt;purple_login&lt;/code&gt;, &lt;code&gt;purple_whoami&lt;/code&gt;, &lt;code&gt;purple_analyze_repo&lt;/code&gt;, &lt;code&gt;purple_register_project&lt;/code&gt;, and twelve more covering projects, repos, instructions, workflows, and SSE.&lt;/li&gt;
&lt;li&gt;Red (4): &lt;code&gt;red_create_bug_report&lt;/code&gt;, &lt;code&gt;red_get_bug_report&lt;/code&gt;, &lt;code&gt;red_analyze_bug_report&lt;/code&gt;, &lt;code&gt;red_submit_bug_fix_plan_to_purple&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Blue (4): &lt;code&gt;blue_list_e2e_tests&lt;/code&gt;, &lt;code&gt;blue_generate_e2e_test&lt;/code&gt;, &lt;code&gt;blue_run_e2e_test&lt;/code&gt;, &lt;code&gt;blue_get_e2e_test_results&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Green (4): &lt;code&gt;green_create_consultation_with_message&lt;/code&gt;, &lt;code&gt;green_send_consultation_message&lt;/code&gt;, &lt;code&gt;green_convert_consultation_to_prd&lt;/code&gt;, &lt;code&gt;green_create_kickoff&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Auth (2): &lt;code&gt;auth_agent_signup&lt;/code&gt;, &lt;code&gt;auth_get_pricing&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Cross (1): &lt;code&gt;codens_register_project_unified&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where this lands against the public reference points:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Approx. tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Playwright MCP&lt;/td&gt;
&lt;td&gt;many&lt;/td&gt;
&lt;td&gt;13,700&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chrome DevTools MCP&lt;/td&gt;
&lt;td&gt;many&lt;/td&gt;
&lt;td&gt;18,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5-server stack (mixed)&lt;/td&gt;
&lt;td&gt;varies&lt;/td&gt;
&lt;td&gt;~55,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;codens-mcp&lt;/code&gt; (unified)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~4,720&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If we had shipped five separate MCPs, one per product, even at a conservative per-server registration overhead the stack would have cost ~65K tokens of context before any tool ran. We did not, and that is the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one package works
&lt;/h2&gt;

&lt;p&gt;Five decisions did the work. None of them are clever. All of them are boring tradeoffs that happen to compound.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Prefix namespacing instead of MCP-server-level scoping
&lt;/h3&gt;

&lt;p&gt;Every tool carries its product prefix in the name. The flat namespace makes the file you saw above legal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;purple_login, purple_whoami, purple_analyze_repo, ...
red_create_bug_report, red_analyze_bug_report, ...
blue_generate_e2e_test, blue_run_e2e_test, ...
green_convert_consultation_to_prd, ...
auth_agent_signup, auth_get_pricing
codens_register_project_unified
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We pay verbosity in the tool name. We get zero collision risk and one MCP process. I considered nested groupings (&lt;code&gt;codens.red.create_bug_report&lt;/code&gt; style), but flat names render cleaner in tool-use traces and grep better in logs. Worth it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Shared client code
&lt;/h3&gt;

&lt;p&gt;All five product clients live in one place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/codens_mcp/client/
  auth.py
  blue.py
  green.py
  red.py
  auth_helper.py    # JWT load/refresh, shared
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the part that does not show up in the token count but matters for the maintenance story. Five separate MCP packages would mean five copies of &lt;code&gt;auth_helper.py&lt;/code&gt; drifting independently. One package means one bug fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Single auth flow
&lt;/h3&gt;

&lt;p&gt;Auth Codens is the SSO root for the family, so the MCP server only ever speaks one login dialect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codens-mcp login        &lt;span class="c"&gt;# Device Code Flow, runs once&lt;/span&gt;
&lt;span class="c"&gt;# token persisted to ~/.purple-codens/credentials.json&lt;/span&gt;
&lt;span class="c"&gt;# every product client reads the same file&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The historical path is &lt;code&gt;~/.purple-codens/credentials.json&lt;/code&gt; because Purple shipped first and we did not want to break existing users by renaming. Cosmetic debt, zero functional cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Re-export pattern for Purple
&lt;/h3&gt;

&lt;p&gt;This is the move that kept us honest. Purple already had a standalone MCP package on PyPI (&lt;code&gt;purple-codens-mcp&lt;/code&gt;) before the unified server existed. We did not fork it. The unified package imports and re-registers Purple's tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# src/codens_mcp/tools/purple_tools.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;purple_codens_mcp.tools.project_tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_project_tools&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;purple_codens_mcp.tools.repo_tools&lt;/span&gt;    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_repo_tools&lt;/span&gt;
&lt;span class="c1"&gt;# ...four more imports
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;register_purple_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;_register_purple_auth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_purple_get_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;_register_projects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_purple_get_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;_register_repos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_purple_get_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Existing users of &lt;code&gt;purple-codens-mcp&lt;/code&gt; on PyPI keep working unchanged. &lt;code&gt;codens-mcp&lt;/code&gt; adds Red, Blue, Green, Auth, and Cross on top. One package can be fully replaced by the other without breaking anyone, which gave us a safe rollout.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Lazy execution
&lt;/h3&gt;

&lt;p&gt;The 4,720 tokens is the registration cost. Claude Code sees all 31 tool descriptors at startup. Each tool's actual HTTP call only fires on invocation, and the per-call response is bounded by the tool's own prompt (usually a few hundred tokens of JSON). The thing that scales linearly with use is the conversation transcript, not the registration. Bloat at startup is the lever; we pulled it once, and the rest of the session is unaffected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest tradeoffs
&lt;/h2&gt;

&lt;p&gt;Unified is not free. Three things we gave up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One process is one failure mode.&lt;/strong&gt; If &lt;code&gt;codens-mcp&lt;/code&gt; crashes, all five product surfaces are gone simultaneously. With separate MCPs each product gets its own isolation boundary and a Red bug cannot take down Green tooling. We accepted this because we are a small shop, the package is small, and a crash in production would tell us we have a much bigger problem than tool routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update cadence is coupled.&lt;/strong&gt; Shipping a new Red tool means cutting a new version of the whole package. Users get every product's churn whether they wanted it or not. We considered semver-per-product subnamespacing and rejected it because our internal release cadence is already weekly and roughly synchronized; the imaginary user who wants Red on a daily cycle but Green frozen does not exist for us yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission boundary is coarse at the MCP layer.&lt;/strong&gt; Authenticating once gives the user access to all 31 tools. You cannot tell Claude Code "allow Red but not Green" through the MCP descriptors alone. We solved this one level up: Auth Codens enforces role-based permissions on the server side, so even if the MCP exposes &lt;code&gt;green_create_kickoff&lt;/code&gt;, the API call rejects users who do not have the Green entitlement. The MCP becomes the surface; the gate lives elsewhere.&lt;/p&gt;

&lt;p&gt;"Unified is always right" is not the conclusion here. If you ship one MCP per oncall team and the teams release on different cycles, you are paying the token tax for a reason, and the isolation buys you something real. The unified shape worked for us because the products were already coupled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the token bloat actually comes from
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://twitter.com/akshay_pachaar" rel="noopener noreferrer"&gt;Akshay's follow-up tweet&lt;/a&gt; closes the loop:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The protocol was never the bottleneck. The loading strategy was."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the line I want every MCP author to internalize. The 55K-token figure is not what MCP-the-spec costs. It is what N separate handshakes plus N capability advertisements plus N redundant client preambles cost when you let your tools sprawl into N independent servers.&lt;/p&gt;

&lt;p&gt;Look at the math from the other direction. If five separate MCPs each carry a 10–15K registration footprint (one server's worth of capability JSON, instructions, schema bundles), you are at 50–75K before the model has done anything useful. Collapse the five servers to one and the registration overhead collapses too, because there is only one capability list, one instruction blob, one schema bundle, and the per-tool descriptor cost is small.&lt;/p&gt;

&lt;p&gt;The protocol is doing its job. The protocol is also fine with you stacking five copies of itself in your config file, because that is a user choice, not a spec smell. Treating MCP servers like microservices ("one per product, for isolation") is the analogue of running 30 Lambda cold starts where one process would do.&lt;/p&gt;

&lt;p&gt;We did not invent a new transport. We did not strip schemas. We just stopped paying for five handshakes when one would do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;Partition your MCP surface by domain, not by tool class. If five tools share an auth root, a release cadence, and a user mental model, they belong in one server. If they do not, split. The token cost is a downstream signal of how well that partition matches reality.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;codens-mcp&lt;/code&gt; is on PyPI: &lt;code&gt;pip install codens-mcp&lt;/code&gt;. Code lives at &lt;a href="https://github.com/codens-ai" rel="noopener noreferrer"&gt;github.com/codens-ai&lt;/a&gt;. If you want the user-facing pitch, that is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;codens.ai/en&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>claude</category>
      <category>python</category>
      <category>architecture</category>
    </item>
    <item>
      <title>"How one empty message poisoned an entire AI consultation (and the three-layer fix)"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Mon, 11 May 2026 05:33:11 +0000</pubDate>
      <link>https://dev.to/zoetaka38/how-one-empty-message-poisoned-an-entire-ai-consultation-and-the-three-layer-fix-57fb</link>
      <guid>https://dev.to/zoetaka38/how-one-empty-message-poisoned-an-entire-ai-consultation-and-the-three-layer-fix-57fb</guid>
      <description>&lt;p&gt;A user opened a support thread saying their AI consultation had gone unresponsive. Every message they sent came back with an error. Refreshing didn't help. Starting a new tab didn't help. From their side, the conversation was dead.&lt;/p&gt;

&lt;p&gt;The product is Codens Green, a PRD management tool where users hold long, iterative conversations with Claude to refine product requirements. Some of those conversations run dozens of turns. This particular one had thirty-something messages of history, all looking normal in the database. The row was there. The user was authenticated. The organization had credits. And yet every new message hit the API and bounced.&lt;/p&gt;

&lt;p&gt;By the time we shipped the fix it was three layers deep, and only one of those layers is the "actual" fix. The other two were the kind of belt-and-suspenders you only put on once you've been burned. I want to walk through what we saw, what we tried first (which was wrong), what the real cause turned out to be, and the shape of the patch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 400 BadRequest looked like
&lt;/h2&gt;

&lt;p&gt;The backend log for the failing consultation looked like this on every request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;ERROR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Failed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;generate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;AI&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;response:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;code:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;'type':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'invalid_request_error'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="err"&gt;'message':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'messages.&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;content&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;blocks&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;must&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;be&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;non-empty'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same error, same index, every time. The user retried, our code retried, the error didn't move. Index 17 was always index 17 because index 17 was sitting in their stored history.&lt;/p&gt;

&lt;p&gt;I went down the wrong path first. The error code was 400, which felt like an auth-shaped problem, so I started there. Wrong key? The key was fine, every other org was working. Rate limit? No, this org wasn't anywhere close. Model deprecation? We were on a current model, and other consultations using the exact same model were responding normally. I checked the Anthropic status page. Green across the board. I checked our own credit-deduction logic to make sure we weren't somehow short-circuiting requests. Clean.&lt;/p&gt;

&lt;p&gt;About forty minutes in I noticed the &lt;code&gt;messages.17&lt;/code&gt; part of the error and felt stupid. The API was telling me exactly which message in the array it didn't like. I just hadn't read it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real cause
&lt;/h2&gt;

&lt;p&gt;I pulled the consultation row, parsed its &lt;code&gt;messages&lt;/code&gt; JSON, and walked it. Most messages had a few hundred characters of content. Message 17, an assistant message, had &lt;code&gt;content: ""&lt;/code&gt;. Empty string. Not whitespace, not null, just empty.&lt;/p&gt;

&lt;p&gt;Claude's API rejects requests where any message in the &lt;code&gt;messages&lt;/code&gt; array has empty content. That's a hard validation at the boundary, not a soft failure. Which meant: the moment that empty message landed in the consultation's history, every future call was guaranteed to fail, because every future call assembled the full history and sent it back to the API. The conversation had been poisoned by one row.&lt;/p&gt;

&lt;p&gt;The user couldn't recover from inside the app. Our UI didn't expose a "delete message" affordance for this surface, and even if it did, the broken message was an assistant turn, not theirs to edit. From the user's perspective, the consultation just stopped working. Forever. With no error message that meant anything to them.&lt;/p&gt;

&lt;p&gt;This is the worst kind of bug. It only surfaces for users with enough history to have triggered the rare condition that produced the bad row, the dashboards don't flag it (a 400 from Claude looks like an intermittent upstream failure if you don't drill in), and the root cause is invisible because it happened on some earlier request you weren't watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  How an empty assistant message ever got saved
&lt;/h2&gt;

&lt;p&gt;Once I knew what to look for, the chain was straightforward.&lt;/p&gt;

&lt;p&gt;Claude's API occasionally returns a response where the assistant's &lt;code&gt;text_content&lt;/code&gt; is empty. I don't have a great theory for why. Could be transient, could be an edge case in their content filtering, could be a race in how we parse &lt;code&gt;content&lt;/code&gt; blocks when the response has tool-use blocks but no text blocks. It's rare. I'd guess less than one in ten thousand calls in our traffic. But across enough users and enough turns, "rare" becomes "guaranteed."&lt;/p&gt;

&lt;p&gt;Our previous code did approximately this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ai_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_claude_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_consultation_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ai_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_assistant_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ai_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ai_metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ai_response&lt;/code&gt; could be &lt;code&gt;""&lt;/code&gt;. Nothing checked. The empty string flowed into &lt;code&gt;add_assistant_message&lt;/code&gt;, got appended to the message list, and the entity got persisted. From that point forward, the consultation was permanently broken.&lt;/p&gt;

&lt;p&gt;One unchecked write, two days earlier, became a permanent block on the user's account.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three-layer fix
&lt;/h2&gt;

&lt;p&gt;The patch split into three layers. Each one defends a different boundary, and only the middle one is what I'd call the real fix. The other two are there because the real fix doesn't help users who already have a poisoned row, and because I wanted to bound the failure surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: filter on the way out
&lt;/h3&gt;

&lt;p&gt;In the &lt;code&gt;Consultation&lt;/code&gt; domain entity, &lt;code&gt;get_messages_for_ai()&lt;/code&gt; is what assembles the array we send to Claude. The old version included every non-system message. The new version also excludes anything with empty or whitespace-only content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_messages_for_ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;MessageRole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SYSTEM&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the layer that unsticks every existing poisoned consultation. We didn't run a data migration. We didn't write a one-shot cleanup script. The filter at read time simply skips the bad row on the way to the API, and the conversation works again. The bad row is still sitting in the DB, but it's never sent anywhere that would reject it.&lt;/p&gt;

&lt;p&gt;I want to be honest about what this layer is and isn't. It's defensive. It papers over bad data. It does not prevent the bug from happening again. If you only ship this layer, you keep generating empty rows and keep skipping them, which is fine until something else relies on the history being complete (PRD generation from conversation summary, for instance) and now the user's PRD is missing a turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: detect on the way in
&lt;/h3&gt;

&lt;p&gt;This is the real fix. In our Claude client wrapper, &lt;code&gt;generate_consultation_response()&lt;/code&gt; now refuses to return an empty response at all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;text_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No text content in Claude API response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Claude hands us back a response with no text blocks (or only empty text blocks), we raise. The caller in &lt;code&gt;AddMessageUseCase&lt;/code&gt; already has a try/except around the API call and falls back to a generic "sorry, please try again" message. Crucially, that fallback message goes to the user as a transient response. It does not get persisted as an assistant turn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_messages_for_ai&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ai_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_claude_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_consultation_response&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="n"&gt;ai_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to generate AI response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ai_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;申し訳ありません。AIからの応答の生成中にエラーが発生しました。...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait, that's not quite right as stated. Look at the existing code and you'll see the fallback message does get persisted via &lt;code&gt;add_assistant_message&lt;/code&gt; further down. That's a separate concern we'll come back to. What matters here is that with Layer 2 in place, the assistant message that gets stored on a failed call is either real text or our explicit, non-empty fallback string. It is never &lt;code&gt;""&lt;/code&gt;. The DB cannot accumulate another poisoned row from this code path.&lt;/p&gt;

&lt;p&gt;If you can only ship one of the three layers, ship this one. Defending at the output boundary, the moment data crosses from "external API response" into "thing we persist," is where bad data deserves to die. Filtering at read time is a workaround. Validating at write time is the fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: bound the history
&lt;/h3&gt;

&lt;p&gt;This one is technically a separate bug, but I shipped it in the same PR series because the user-visible symptom overlaps. Long consultations were starting to push against the context window, and a few users were seeing failures that looked similar (intermittent API errors on long-running conversations) but had a different cause.&lt;/p&gt;

&lt;p&gt;So in &lt;code&gt;AddMessageUseCase&lt;/code&gt;, we cap the history we send:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_HISTORY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_HISTORY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;MAX_HISTORY&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Forty messages is roughly twenty user/assistant turns. The trailing slice gets the most recent context, which is almost always what matters. The &lt;code&gt;while&lt;/code&gt; loop handles a Claude API requirement that conversations must start with a user role. If the slice happens to begin with an assistant message (because we truncated mid-turn), we drop the leading assistants until we find a user message.&lt;/p&gt;

&lt;p&gt;Three things to flag about Layer 3. First, twenty turns is a product choice, not a technical limit; we picked it because our consultation UI doesn't show more than that comfortably anyway, and longer histories were producing diminishing returns on AI quality. Second, the first-user-role correction is a Claude-specific constraint. Don't carry this verbatim to a different provider without checking their docs. Third, this layer is unrelated to the empty-message bug. It's bundled in because the failure mode looks adjacent from a triage perspective, and shipping them together meant one round of regression testing instead of two.&lt;/p&gt;

&lt;h2&gt;
  
  
  The migration we didn't write
&lt;/h2&gt;

&lt;p&gt;One thing I want to underline. Layer 1, the read-time filter, accidentally did the work of a data migration without being a data migration. Every existing poisoned consultation in our DB started working again the moment the deploy went out. No SQL to write, no rows to update, no offline job to run. The defensive layer absorbed the historical damage.&lt;/p&gt;

&lt;p&gt;That's not always the right tradeoff. If we'd needed downstream consumers (analytics, PRD generation, exports) to see a complete history, leaving bad rows in place would have leaked into those features later. In our case the only consumer that read the bad message was the call to Claude itself, so filtering at read time was sufficient. But it's worth naming the pattern explicitly: a defensive read-side filter can serve as a zero-downtime migration for a class of bad data, as long as you're confident you've enumerated every reader.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd take away
&lt;/h2&gt;

&lt;p&gt;The thing I keep coming back to is that the cause of the user's problem (one empty cell, written two days earlier, somewhere on the request path) had nothing visible in common with the symptom they were experiencing (every new message fails with a 400 today). The signal that mattered was buried in the error message itself, and I spent forty minutes chasing API keys before I read it. Read the error.&lt;/p&gt;

&lt;p&gt;The three-layer shape, defend on the way in, defend on the way out, bound the size, is general. It works for any case where you're persisting outputs from an external API and replaying them as inputs. Validate before you persist. Filter before you replay. Cap the surface.&lt;/p&gt;

&lt;p&gt;If you're building anything with Claude, &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;Codens&lt;/a&gt; is what we use this same stack to build.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>python</category>
      <category>debugging</category>
    </item>
  </channel>
</rss>
