<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Takayuki Kawazoe</title>
    <description>The latest articles on DEV Community by Takayuki Kawazoe (@zoetaka38).</description>
    <link>https://dev.to/zoetaka38</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3902826%2F0187a85d-f9a1-45bb-871d-bf5e49ddcccc.jpeg</url>
      <title>DEV Community: Takayuki Kawazoe</title>
      <link>https://dev.to/zoetaka38</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zoetaka38"/>
    <language>en</language>
    <item>
      <title>"One JWT, five services, and the python-jose audience list trap"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Sat, 16 May 2026 04:34:53 +0000</pubDate>
      <link>https://dev.to/zoetaka38/one-jwt-five-services-and-the-python-jose-audience-list-trap-5e3i</link>
      <guid>https://dev.to/zoetaka38/one-jwt-five-services-and-the-python-jose-audience-list-trap-5e3i</guid>
      <description>&lt;p&gt;&lt;code&gt;audience must be a string or None&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That was the exception python-jose threw the moment our unified MCP server tried to talk to the second backend behind it. The token was valid. The signature checked out. The claims were correct. The library just refused to accept a list as the expected audience, and the JWT spec disagrees with the library on whether that should be a problem.&lt;/p&gt;

&lt;p&gt;We run a single MCP server, &lt;code&gt;codens-mcp&lt;/code&gt; on PyPI, that fronts five backends: Red (auto-fix), Blue (QA), Green (PRD), Purple (orchestration), and Auth. One MCP token, five destinations. When Claude calls a Red tool, the MCP server proxies an HTTP request to the Red backend carrying that same token. Same for Blue, Green, Purple, Auth. Each backend has its own primary audience for its own user-facing tokens, and we wanted all of them to also accept the MCP server's token without minting five service-specific JWTs per session.&lt;/p&gt;

&lt;p&gt;This is the story of how that ran into a python-jose quirk, and the 12-line workaround we ended up shipping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture, briefly
&lt;/h2&gt;

&lt;p&gt;Codens exposes 31 tools across the five product surfaces through one MCP server. From Claude's side it is a single connection. From the backends' side, each one sees a normal authenticated HTTP request with a bearer token in the header. The token is issued by the Auth service. Its &lt;code&gt;aud&lt;/code&gt; claim is &lt;code&gt;purple-codens-mcp&lt;/code&gt;, because the MCP server is the thing the user logged into when they connected their client.&lt;/p&gt;

&lt;p&gt;Each backend already had its own audience for its first-party tokens. Green expects &lt;code&gt;green-codens&lt;/code&gt;. Red expects &lt;code&gt;red-codens&lt;/code&gt;. And so on. Those audiences were baked into the OAuth verifier and matched the audience claim on tokens minted by that service's own login flow.&lt;/p&gt;

&lt;p&gt;We had two ways forward.&lt;/p&gt;

&lt;p&gt;The first option: mint five tokens per MCP session. The MCP server logs into Red, Green, Blue, Purple, and Auth as the user, gets five JWTs, and selects the right one based on which tool the user invoked. This is conceptually clean. It also means five times the token issuance, five rotation surfaces, five sets of refresh flows to coordinate, and a routing layer in the MCP server that has to know which token belongs to which tool. None of that adds value.&lt;/p&gt;

&lt;p&gt;The second option: mint one token, declare its audience as &lt;code&gt;purple-codens-mcp&lt;/code&gt;, and teach every backend to accept that audience in addition to its own primary one. The MCP server holds one credential. Each backend keeps its primary audience for its own native flows and additionally trusts MCP-issued tokens. Rotation surface stays small. The routing logic in the MCP server disappears.&lt;/p&gt;

&lt;p&gt;We picked option two. The plan was to add a per-service config that lists additional accepted audiences, expand the verifier to check against the union, and ship it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix v1: pass a list to python-jose
&lt;/h2&gt;

&lt;p&gt;The setting looked like this in every backend service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_AUDIENCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;green-codens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purple-codens-mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The verifier change looked equally innocuous. python-jose's &lt;code&gt;jwt.decode&lt;/code&gt; accepts an &lt;code&gt;audience&lt;/code&gt; keyword. The naive reading of every JWT tutorial on the internet says you give it the expected audience and it checks the token's &lt;code&gt;aud&lt;/code&gt; against that. So we built a list of accepted audiences and handed it over:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verify_audience&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;audiences&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jwt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;algorithms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;algorithm&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;audiences&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the version we wrote, ran a quick local smoke test against, and pushed to the dev environment thinking the work was done. The shape of the change matched the shape of the problem. A list of allowed audiences in, an &lt;code&gt;aud&lt;/code&gt; claim checked against that list, request accepted. Done.&lt;/p&gt;

&lt;p&gt;The dev environment, of course, immediately disagreed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap
&lt;/h2&gt;

&lt;p&gt;The MCP server made its first call into Green and the request came back as a 401. The Green logs had the actual exception underneath the generic auth failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TypeError: audience must be a string or None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;python-jose's &lt;code&gt;jwt.decode&lt;/code&gt; does not accept a list for its &lt;code&gt;audience&lt;/code&gt; parameter. If you pass one, it raises before it even looks at the token. The library has only ever supported single-string audience verification. There is no flag, no overload, no helper that takes a list.&lt;/p&gt;

&lt;p&gt;RFC 7519 is unambiguous on the other side of this question. Section 4.1.3 defines &lt;code&gt;aud&lt;/code&gt; as either a single case-sensitive string or an array of case-sensitive strings, and verification logic is supposed to check that the recipient identifies itself with at least one of the values present. The spec assumes set membership semantics on both ends. The token can have multiple audiences, and the verifier can accept multiple audiences. Whether either side is a list is a transport detail.&lt;/p&gt;

&lt;p&gt;python-jose is one of the most-used Python JWT libraries. Most FastAPI tutorials reach for it without thinking. It is also old, and the maintainer activity is thin. There is a multi-year-old GitHub issue tracking exactly this limitation, with patches floating around in forks and pull requests that never merged. The library's behavior is what it is, and if you need list audience verification, you are on your own.&lt;/p&gt;

&lt;p&gt;The honest read here is that the JWT spec describes capability and most libraries describe a comfortable subset of it. The subset is usually fine. The moment you do anything cross-service it stops being fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix v2: decode without audience verification, then verify manually
&lt;/h2&gt;

&lt;p&gt;The fix that worked is to use python-jose for what it is good at, which is signature verification and claim decoding, and do the audience check ourselves. python-jose lets you disable individual claim checks through its &lt;code&gt;options&lt;/code&gt; dict. &lt;code&gt;verify_aud: False&lt;/code&gt; turns off the built-in audience verification entirely. The signature, expiry, issuer, and everything else still get checked. We just take responsibility for &lt;code&gt;aud&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;should_verify_aud&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verify_audience&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jwt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;algorithms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;algorithm&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify_aud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;should_verify_aud&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;allowed_audiences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAUTH_ADDITIONAL_AUDIENCES&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;token_aud&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;token_aud_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token_aud&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_aud_set&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;allowed_audiences&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;InvalidTokenError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid audience: token aud=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token_aud&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt;, expected one of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allowed_audiences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The set intersection does the entire job. &lt;code&gt;token_aud_set &amp;amp; allowed_audiences&lt;/code&gt; returns a set of values present in both, and if that set is empty the token is for someone else and we reject it. If the token's &lt;code&gt;aud&lt;/code&gt; is a single string we wrap it in a one-element set. If it is a list we convert directly. If it is missing we get an empty set and the intersection is empty, which fails closed.&lt;/p&gt;

&lt;p&gt;One subtle thing about the order. We compute &lt;code&gt;should_verify_aud&lt;/code&gt; before calling &lt;code&gt;jwt.decode&lt;/code&gt;, not after, because we want the variable to capture the caller's intent independent of what python-jose returns. If someone passes &lt;code&gt;verify_audience=False&lt;/code&gt;, we skip the manual check entirely. If they pass &lt;code&gt;verify_audience=True&lt;/code&gt; but the service has no configured audience, there is nothing to verify against, so we also skip. The manual block only runs when there is something real to check.&lt;/p&gt;

&lt;p&gt;The error message includes both the token's actual &lt;code&gt;aud&lt;/code&gt; value and the sorted list of audiences we accept. When you debug an inter-service auth failure at 2am, the only thing worse than a 401 with no detail is a 401 that tells you nothing about the mismatch. The cost of formatting that message into the exception is zero and the time it saves is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bonus pattern: decode and verify as separate steps
&lt;/h2&gt;

&lt;p&gt;Once you have done this once, decoupling decoding from verification starts to feel like the right default for any JWT code that has to do anything non-trivial. The library is good at parsing the structure and confirming the signature. Your service is the one that knows which claims matter and what acceptance looks like.&lt;/p&gt;

&lt;p&gt;The same pattern handles a bunch of adjacent problems. Token introspection for audit logs without re-running all the checks. Soft expiry where you log a warning at 90 percent of the lifetime instead of rejecting. Migration windows where you accept tokens signed with either the old or new key for a week. Custom claim validation that the library has never heard of. Whenever a future library bug lands in the issuer check or the expiry math, you have an escape hatch already in place because the verification logic is yours.&lt;/p&gt;

&lt;p&gt;This is also the answer even if python-jose ships list audience support tomorrow. You do not lose anything by owning the audience check. You gain a place to put the next requirement that does not fit cleanly into a kwarg.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;Multi-service authentication keeps running into the gap between what JWT can do and what the convenient libraries actually do. The spec is generous. The libraries are opinionated. When you stitch services together, the opinions usually have to give.&lt;/p&gt;

&lt;p&gt;The unified-token path was worth the workaround. One JWT, one rotation, one issuer, five backends that each know how to accept it. The cost was a dozen lines of manual verification in a shared OAuth module. We would make the same trade again.&lt;/p&gt;

&lt;p&gt;If you want to see how Codens uses this on the agent side, the English landing page is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;. The MCP server is &lt;code&gt;codens-mcp&lt;/code&gt; on PyPI and it is what the agent connects to when it needs to talk to any of the five product surfaces.&lt;/p&gt;

</description>
      <category>jwt</category>
      <category>python</category>
      <category>fastapi</category>
      <category>auth</category>
    </item>
    <item>
      <title>"Claude 3, Qwen 6: why we set a different fix_verify retry cap per model"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Fri, 15 May 2026 07:58:45 +0000</pubDate>
      <link>https://dev.to/zoetaka38/claude-3-qwen-6-why-we-set-a-different-fixverify-retry-cap-per-model-oce</link>
      <guid>https://dev.to/zoetaka38/claude-3-qwen-6-why-we-set-a-different-fixverify-retry-cap-per-model-oce</guid>
      <description>&lt;p&gt;Claude gets 3 retries. Qwen gets 6. Everything else gets 5.&lt;/p&gt;

&lt;p&gt;That is the default &lt;code&gt;fix_verify_retry_cap&lt;/code&gt; in Codens Purple right now, after a few weeks of staring at fix-rate curves per model. It started as one global cap, the same number for every model the workflow could route to. We changed it once we had enough production data to see that the same number was both too high for one model and too low for another at the same time.&lt;/p&gt;

&lt;p&gt;This is the story of the split, what the loop actually does, and the few lines of code that put the policy in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix_verify loop
&lt;/h2&gt;

&lt;p&gt;Codens Purple runs an agent that proposes a code fix, then verifies it by running a test or a check, then decides whether to retry with feedback from the verification step. The loop looks roughly like this. Generate a candidate change, apply it, run the verify command, read the result. If verify passes, the loop is done. If verify fails, feed the failure output back into the next prompt and try again. Each retry is a new API call. Each API call costs per-token credits, and verify itself costs wall clock time plus whatever the test suite costs to run.&lt;/p&gt;

&lt;p&gt;The retry cap is the integer that says how many of those iterations the loop is allowed before it gives up and surfaces the partial result to the user. A cap of 1 means one attempt, no retry. A cap of 3 means an initial attempt plus two retries. A cap of 6 means up to six attempts total.&lt;/p&gt;

&lt;p&gt;The cap matters because the curve of "fix succeeds at attempt N" is not flat. It is heavily front-loaded. Most successful fixes succeed on attempt 1 or 2. The question for any given model is how long the long tail is, and how much of that tail is worth paying for.&lt;/p&gt;

&lt;p&gt;When we had one cap for all models, that one number had to be a compromise. The compromise was bad in two directions at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we got to multi-model
&lt;/h2&gt;

&lt;p&gt;Codens started with Claude as the only model. Specifically, Claude via the Anthropic API, using a raw API key with per-token billing. Not the subscription, not the bundled tier. We are a multi-tenant product running thousands of small &lt;code&gt;fix_verify&lt;/code&gt; cycles per day across many customers, and a subscription does not cleanly support that shape of workload. Per-token billing lets us scale spend with usage and attribute cost back to the project that incurred it.&lt;/p&gt;

&lt;p&gt;This came up again recently when Anthropic announced that the &lt;code&gt;claude -p&lt;/code&gt; print mode, the Agent SDK, and CI use cases now require an API plan rather than a subscription. For us this was a non-event. We were already on the API. The announcement just confirmed that the path we picked is the path Anthropic wants production agent workloads to take.&lt;/p&gt;

&lt;p&gt;Claude is excellent for &lt;code&gt;fix_verify&lt;/code&gt;. The per-attempt success rate is high and the failure modes are usually informative, meaning when it does not fix the bug on attempt 1, the diff it produces and the verify output together give the next attempt a real signal. The downside is cost. At scale, with thousands of fix loops a day, the per-token bill is a real line item.&lt;/p&gt;

&lt;p&gt;A few months in, we started evaluating Qwen as a secondary model to drive cost down on a subset of tasks. Qwen runs on our own infrastructure on AWS EC2 hosts, which gives us per-token cost well below the Anthropic API for the same task size. The tradeoff was the reliability profile. Per-attempt success rate is lower than Claude. Failure modes are noisier. Some of the time the model will produce a syntactically valid but semantically wrong patch, and the verify step is the only thing that catches it.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of model where retries earn their keep. Qwen's curve of cumulative success vs attempt number rises more slowly than Claude's, but it keeps rising further out. Attempt 5 is still adding meaningful success rate. With Claude, attempt 5 is mostly wasted credits on a fundamentally wrong understanding that more retries are not going to fix.&lt;/p&gt;

&lt;p&gt;So we had two models in production with different shapes of success curve, and we were applying the same retry cap to both. Something had to give.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one cap did not work
&lt;/h2&gt;

&lt;p&gt;Suppose we set the global cap to 3, tuned for Claude. Claude is fine. Qwen leaves real success on the table, because attempts 4, 5, and 6 would have converted a measurable fraction of failures into passes, and now they do not happen. Fix rate drops on Qwen-routed tasks. Users notice. They route more work to Claude, which is the opposite of what we wanted from introducing Qwen.&lt;/p&gt;

&lt;p&gt;Suppose we set the global cap to 6, tuned for Qwen. Qwen is fine. Claude wastes credits. Attempts 4, 5, and 6 on a Claude-routed task that has already failed three times have a low chance of succeeding, because Claude's failure mode at attempt 3 is usually "I do not understand the bug" or "the test I am running is checking something I cannot see," and the same prompt with the same verify output is not going to flip that on attempt 6. We were paying full Sonnet-tier per-token cost for those attempts.&lt;/p&gt;

&lt;p&gt;The compromise we ran for a while was a cap of 5 globally. It was bad on both axes. Claude wasted 2 attempts worth of credits on its failure cases. Qwen left 1 attempt worth of success on the floor. We could see this in the data once we started bucketing the loop outcome by model and attempt number. The right answer was clearly per-model, not global.&lt;/p&gt;

&lt;h2&gt;
  
  
  The per-model defaults
&lt;/h2&gt;

&lt;p&gt;The implementation is small. We added a nullable integer column on the project table, &lt;code&gt;fix_verify_retry_cap&lt;/code&gt;, with NULL meaning "use the model-based default." A helper function returns the default for a given model name. The use case layer combines the two when it kicks off a loop.&lt;/p&gt;

&lt;p&gt;The helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_default_fix_verify_cap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The schema field, on the project update payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PurpleProjectUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;fix_verify_retry_cap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Alembic migration adds the column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purple_projects&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix_verify_retry_cap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the use case resolves the effective cap when it starts a task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;effective_cap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fix_verify_retry_cap&lt;/span&gt;
    &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_default_fix_verify_cap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;execute_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The override range is 1 to 20. One on the low end because some projects have run a single attempt followed by a human review, and we do not want to break that pattern. Twenty on the high end because it is a reasonable ceiling for a customer who wants to push the long tail of a cheap self-hosted model further than our default. If they set 20 and burn through it, that is their cost. We log the effective cap on every task so it shows up in the project audit log alongside the outcome.&lt;/p&gt;

&lt;p&gt;The defaults of 3, 5, 6 are not magic numbers pulled out of intuition. We picked them by plotting cumulative fix rate against attempt number for each model from a few weeks of production runs and looking at where the curve flattens. For Claude, the curve is essentially flat past attempt 3. For Qwen, it is still meaningfully rising at 5 and starts to flatten at 6. For other models we had less data, so 5 is the safe middle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff
&lt;/h2&gt;

&lt;p&gt;The honest cost of this change is that adding a new model to the routing layer is no longer free. Before, we added a model and it inherited the global cap. Now we have to pick a default. If we do not pick one, the model falls through to the 5 default, which is usually fine but not always optimal.&lt;/p&gt;

&lt;p&gt;In practice, this turned into a small ritual when introducing a new model. Route a small fraction of traffic to it at cap 8 or 10 for a week, plot the curve, find the elbow, set the default to one or two above the elbow. The ritual takes a few hours of analysis on top of the model integration itself. We considered automating it, computing the default from rolling fix rates per model on a cadence. We have not built that yet. The set of models we route to is small enough that a manual review every couple of months is fine. If the set grew to ten or more, automation would start to pay back.&lt;/p&gt;

&lt;p&gt;The other tradeoff is that the policy is now opinionated in a way users can feel. If a customer on a Claude-routed project reports "fix gave up too early," the answer is sometimes "the default cap is 3, raise it to 5 on your project and try again." That is a real conversation we have had. It is the price of a default that is right on average but not for every codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the cap is, really
&lt;/h2&gt;

&lt;p&gt;A retry cap is a budget. Specifically, it is a budget that integrates two things at once. The marginal probability of success at each attempt. The marginal cost of each attempt. The optimal cap is the largest N where the expected value of attempt N is still positive, which means attempt N's marginal success times the value of a fix exceeds attempt N's marginal cost in credits and verify time. That number is per-model because both factors are per-model.&lt;/p&gt;

&lt;p&gt;When we set 3 for Claude and 6 for Qwen, we are saying the integral converges faster on Claude because high per-attempt success runs out of incremental room quickly, and converges slower on Qwen because lower per-attempt success keeps adding incremental room for longer at a much lower per-attempt cost. The split is what makes a multi-model workflow economically coherent.&lt;/p&gt;

&lt;p&gt;If you are running anything like this loop in production, do not pick one number for all your models. Plot the curve. The number falls out.&lt;/p&gt;

&lt;p&gt;Codens Purple is part of the harness at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt; . The retry cap split lives in &lt;code&gt;purple-codens&lt;/code&gt; under the project use case layer.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>python</category>
      <category>anthropic</category>
    </item>
    <item>
      <title>"When 'Control request timeout: initialize' actually means SIGKILL: Claude Code CLI OOM inside Celery"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Thu, 14 May 2026 00:08:43 +0000</pubDate>
      <link>https://dev.to/zoetaka38/when-control-request-timeout-initialize-actually-means-sigkill-claude-code-cli-oom-inside-n0o</link>
      <guid>https://dev.to/zoetaka38/when-control-request-timeout-initialize-actually-means-sigkill-claude-code-cli-oom-inside-n0o</guid>
      <description>&lt;p&gt;A production Celery task in Codens Green started returning this, intermittently, only under real load:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Control request timeout: initialize
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The string is suspiciously specific. It looks like the kind of message you would see if Claude Code CLI's MCP initialization handshake had timed out on the other side of a pipe. That is what it sounds like. That is not what it was.&lt;/p&gt;

&lt;p&gt;The task is &lt;code&gt;analyze_code_specification&lt;/code&gt;. It spawns Claude Code CLI as a subprocess to analyze a repository against a PRD. It worked in staging, worked locally, worked in CI. It failed in production a few times a day, almost always when more than one analysis was running at the same time.&lt;/p&gt;

&lt;p&gt;What we eventually shipped: route that task to a dedicated Celery queue, run that queue on a separate ECS Fargate worker tier with 8 GB of memory, pin concurrency to 1. The real bug was the Linux kernel OOM killer terminating Claude Code CLI partway through startup, before it could complete its handshake with the parent task. The misleading log line was just what survives when a child process is shot in the head mid-init.&lt;/p&gt;

&lt;p&gt;This is the chase.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrong paths
&lt;/h2&gt;

&lt;p&gt;I spent the better part of a day inside Claude Code CLI's initialization code path, because that is where the error string lived.&lt;/p&gt;

&lt;p&gt;First theory: stdio buffering. The CLI talks to the parent over stdin/stdout. If the parent is not reading fast enough, the child can block on a full pipe and look like it is hanging. I added explicit buffer drains, raised the timeout, switched to line-buffered mode on both sides. The error still happened.&lt;/p&gt;

&lt;p&gt;Second theory: MCP protocol version mismatch. Maybe a recent Claude Code update changed the init handshake and our version pin was stale. I diffed the changelog, compared protocol versions across our deployed image and a known-good local environment. They matched.&lt;/p&gt;

&lt;p&gt;Third theory: a bug in the agent SDK config. We pass a lot of options into the CLI. Maybe one of them was triggering a slow path during init that exceeded the handshake budget. I trimmed the config down to the smallest reproducible set, then to nothing. Same error in production. Still nothing in staging.&lt;/p&gt;

&lt;p&gt;Fourth theory, the one I am least proud of: maybe Claude Code itself has an upstream init bug under concurrent load. I drafted half of a GitHub issue before I noticed I had no actual evidence and was just frustrated.&lt;/p&gt;

&lt;p&gt;None of these held up. The fingerprint of the failure, intermittent, only under load, only in production, did not match any of them. Buffering bugs are deterministic. Protocol mismatches are deterministic. Config bugs are deterministic. This was load-correlated. That is a different shape of problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exit code
&lt;/h2&gt;

&lt;p&gt;The thing that finally cracked it was looking at the subprocess exit code instead of the log message. We were capturing the error string before we captured &lt;code&gt;returncode&lt;/code&gt;, and the error string was so plausible it had crowded out the rest of the diagnostic surface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;proc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_subprocess_exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;communicate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude code failed rc=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The value coming out was &lt;code&gt;-9&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On POSIX, when &lt;code&gt;subprocess&lt;/code&gt; reports a negative return code, the absolute value is the signal that killed the child. Signal 9 is SIGKILL. SIGKILL cannot be caught, cannot be handled, cannot be cleaned up after. The process is removed from the run queue. There is exactly one common source of SIGKILL on Linux that arrives without a parent or operator sending it on purpose: the kernel OOM killer.&lt;/p&gt;

&lt;p&gt;That was the moment. This is no longer a Claude Code problem. This is an OS-level problem. The CLI had not timed out during initialization. The CLI had been shot during initialization, by the kernel, for using too much memory.&lt;/p&gt;

&lt;p&gt;The "Control request timeout: initialize" message was a downstream symptom. The parent task was waiting for the child to finish its handshake. The child was killed mid-handshake. The parent eventually gave up waiting and surfaced the most specific thing it knew, which was that init had not completed in time. The error was technically true and completely misleading.&lt;/p&gt;

&lt;h2&gt;
  
  
  OOM math
&lt;/h2&gt;

&lt;p&gt;Once you know the shape, the math is easy.&lt;/p&gt;

&lt;p&gt;Claude Code CLI is not a small process. It boots a JavaScript runtime, loads the agent SDK, hydrates context, and prepares for tool calls. In our workload, resident memory per invocation sits between roughly 500 MB and 1.5 GB, peaking higher during initial context load.&lt;/p&gt;

&lt;p&gt;Our Celery worker pool was the general-purpose one. Sized for the rest of our tasks, which are normal Python work: webhook fan-out, database writes, small HTTP calls. Those tasks live happily in well under 200 MB each. The worker host had memory headroom appropriate to that profile, with default Celery concurrency, which spins up multiple worker processes per host so several tasks run in parallel.&lt;/p&gt;

&lt;p&gt;That is fine for normal traffic. It is not fine when two of those parallel tasks each decide to spawn a 1+ GB CLI subprocess.&lt;/p&gt;

&lt;p&gt;Picture the failure mode. Two PRDs are submitted within the same minute. Two Celery workers pick up &lt;code&gt;analyze_code_specification&lt;/code&gt;. Each launches Claude Code CLI. Both CLIs start allocating. The host's resident memory climbs past its limit. The kernel's OOM killer wakes up and picks a victim, typically the largest recent allocator. Claude Code CLI dies with SIGKILL. The Celery task surfaces "Control request timeout: initialize" because that is what it saw from its end of the pipe. The other task may or may not also die, depending on timing.&lt;/p&gt;

&lt;p&gt;The reason this never showed up in staging was simple: staging has one user, me, running one job at a time. Concurrency was always 1 by accident. The bug needed two simultaneous invocations on the same host to express itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix, in four parts
&lt;/h2&gt;

&lt;p&gt;I did not want to over-engineer this. The fix is structurally small. It is mostly Celery routing and infra sizing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Dedicated queue.&lt;/strong&gt; &lt;code&gt;analyze_code_specification&lt;/code&gt; got its own queue, separated from everything else.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# celery_app.py
&lt;/span&gt;&lt;span class="n"&gt;task_routes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.analyze_code_specification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.run_fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fixing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.control_plane.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;control_plane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.plan_monitor.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan_monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# everything else falls through to "default"
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point of the queue split is not load balancing. It is so we can attach a different worker profile to this task without changing anything about the others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dedicated ECS Fargate worker tier.&lt;/strong&gt; The &lt;code&gt;analysis&lt;/code&gt; queue gets its own worker service, on its own Fargate task definition, with 8 GB of memory. The rest of the workers stay on the smaller general-purpose host. One service, one queue, one process shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Concurrency = 1.&lt;/strong&gt; The worker for the &lt;code&gt;analysis&lt;/code&gt; queue starts like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;celery &lt;span class="nt"&gt;-A&lt;/span&gt; app worker &lt;span class="nt"&gt;-Q&lt;/span&gt; analysis &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 1 &lt;span class="nt"&gt;--loglevel&lt;/span&gt; info
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the load-bearing piece. Even on an 8 GB host, if you let two CLI invocations run in parallel, you can still blow past the limit when both peak at 1.5 GB at the same time and the OS plus worker plus everything else has its own footprint. Concurrency 1 means exactly one Claude Code CLI subprocess exists on this host at any time. Two analyses come in, the second one queues, waits, runs next. Slower, totally fine, never OOMs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Memory headroom.&lt;/strong&gt; 1 CLI × roughly 1.5 GB peak × concurrency 1, against 8 GB total, with the worker process and OS taking a few hundred MB. That gives more than 5 GB of headroom for a worst-case CLI invocation. If we ever needed to raise concurrency to 2, we would also need to either double the instance size or accept the OOM risk back. We chose not to.&lt;/p&gt;

&lt;p&gt;We also added regression tests at the routing layer, asserting that &lt;code&gt;analyze_code_specification&lt;/code&gt; resolves to the &lt;code&gt;analysis&lt;/code&gt; queue, that control-plane tasks do not accidentally get rerouted there, and that plan-monitor isolation is preserved. The routing dict is the kind of thing that quietly bit-rots in a PR review, and a misroute would silently bring the bug back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;The dedicated worker tier is more expensive per task than just bumping the general worker's RAM. It scales slower under burst load because the queue depth gates throughput. It is one more service to deploy, monitor, alert on, and update during a Claude Code CLI version bump. None of that is free.&lt;/p&gt;

&lt;p&gt;What we got in return is that this failure mode cannot happen anymore for any reason that is not "we accidentally raised concurrency above 1." That is a single config line in one repo with a test guarding it. I will take that tradeoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What generalizes
&lt;/h2&gt;

&lt;p&gt;Two things stuck with me after this.&lt;/p&gt;

&lt;p&gt;One: when a child process surfaces a plausible-sounding error during a handshake, check &lt;code&gt;returncode&lt;/code&gt; before you check the message. A negative return code on POSIX is a different category of failure from anything the application itself can report. A negative number is the OS telling you the application never got a chance.&lt;/p&gt;

&lt;p&gt;Two: per-task memory profiles matter for Celery worker sizing in a way that defaults do not protect you from. A worker pool tuned for 200 MB tasks will silently kill a 1.5 GB task and tell you something else happened. If your task spawns a subprocess that is heavier than your worker, the right answer is almost always a separate queue with its own concurrency and its own host, not a bigger general-purpose host.&lt;/p&gt;

&lt;p&gt;We build Codens, an AI dev harness with this kind of analysis baked in. &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>celery</category>
      <category>python</category>
      <category>debugging</category>
    </item>
    <item>
      <title>"Cutting MCP token bloat by 12x: what happened when we packed 31 tools into one server"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Tue, 12 May 2026 02:49:09 +0000</pubDate>
      <link>https://dev.to/zoetaka38/cutting-mcp-token-bloat-by-12x-what-happened-when-we-packed-31-tools-into-one-server-4149</link>
      <guid>https://dev.to/zoetaka38/cutting-mcp-token-bloat-by-12x-what-happened-when-we-packed-31-tools-into-one-server-4149</guid>
      <description>&lt;p&gt;Earlier this week &lt;a href="https://twitter.com/akshay_pachaar" rel="noopener noreferrer"&gt;@akshay_pachaar&lt;/a&gt; summarized a year of MCP-vs-CLI arguing into one sharp line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The MCP vs CLI debate. For most of 2025, AI Engineers argued about it. The skeptics had real numbers: Playwright MCP eats 13.7K tokens, Chrome DevTools MCP eats 18K. A 5-server setup burns 55K tokens before any work."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He is right. Those numbers are the steady drumbeat against MCP as a delivery format. If your agent burns 55K tokens just advertising capabilities, the protocol starts to look like a tax.&lt;/p&gt;

&lt;p&gt;We just shipped a counter-data point. &lt;code&gt;codens-mcp&lt;/code&gt; is a single Python package that exposes 31 tools across five products (Purple, Red, Blue, Green, Auth, plus a cross-product registration tool). I sat down with &lt;code&gt;wc -c&lt;/code&gt; and a calculator and got a number I had to triple-check: the entire tool surface, descriptions and all, is ~4,720 tokens. That is roughly 12x less than the 5-server number in the tweet, and about 3x less than Playwright MCP alone.&lt;/p&gt;

&lt;p&gt;This is not a "look how clever we are" post. It is the boring engineering answer: most of MCP's token cost is not the protocol, it is the loading strategy. Below I walk through how we measured it, the five architecture decisions that made the number small, and the real tradeoffs we ate to get there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The measurement
&lt;/h2&gt;

&lt;p&gt;Here is the actual byte count from the tool definition files, straight off disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;auth_tools.py     1,555 chars
blue_tools.py     2,576 chars
cross_tools.py    3,913 chars
green_tools.py    6,160 chars
purple_tools.py   1,448 chars   # re-exports 16 tools from purple-codens-mcp
red_tools.py      3,231 chars
                 ───────
total            18,883 chars  ≈ 4,720 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 4 chars/token heuristic is a known underestimate for natural-language English (3.5 is closer to GPT/Claude tokenizers in practice), but it is fine as an upper-bound on a registration payload that contains a mix of Python identifiers, docstrings, and JSON-schema-ish hints. The MCP server sends a slightly inflated version of these definitions over the wire as tool descriptors, so the on-context cost the model sees is in the same order of magnitude. I have done the apples-to-apples comparison with &lt;code&gt;tiktoken&lt;/code&gt; on the rendered descriptors and the number lands between 4.4K and 5.1K depending on whether you count the JSON schema framing. ~4,720 is the honest middle.&lt;/p&gt;

&lt;p&gt;The 31 tools break down like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purple (16, re-exported from &lt;code&gt;purple-codens-mcp&lt;/code&gt;): &lt;code&gt;purple_login&lt;/code&gt;, &lt;code&gt;purple_whoami&lt;/code&gt;, &lt;code&gt;purple_analyze_repo&lt;/code&gt;, &lt;code&gt;purple_register_project&lt;/code&gt;, and twelve more covering projects, repos, instructions, workflows, and SSE.&lt;/li&gt;
&lt;li&gt;Red (4): &lt;code&gt;red_create_bug_report&lt;/code&gt;, &lt;code&gt;red_get_bug_report&lt;/code&gt;, &lt;code&gt;red_analyze_bug_report&lt;/code&gt;, &lt;code&gt;red_submit_bug_fix_plan_to_purple&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Blue (4): &lt;code&gt;blue_list_e2e_tests&lt;/code&gt;, &lt;code&gt;blue_generate_e2e_test&lt;/code&gt;, &lt;code&gt;blue_run_e2e_test&lt;/code&gt;, &lt;code&gt;blue_get_e2e_test_results&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Green (4): &lt;code&gt;green_create_consultation_with_message&lt;/code&gt;, &lt;code&gt;green_send_consultation_message&lt;/code&gt;, &lt;code&gt;green_convert_consultation_to_prd&lt;/code&gt;, &lt;code&gt;green_create_kickoff&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Auth (2): &lt;code&gt;auth_agent_signup&lt;/code&gt;, &lt;code&gt;auth_get_pricing&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Cross (1): &lt;code&gt;codens_register_project_unified&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where this lands against the public reference points:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Approx. tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Playwright MCP&lt;/td&gt;
&lt;td&gt;many&lt;/td&gt;
&lt;td&gt;13,700&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chrome DevTools MCP&lt;/td&gt;
&lt;td&gt;many&lt;/td&gt;
&lt;td&gt;18,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5-server stack (mixed)&lt;/td&gt;
&lt;td&gt;varies&lt;/td&gt;
&lt;td&gt;~55,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;codens-mcp&lt;/code&gt; (unified)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~4,720&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If we had shipped five separate MCPs, one per product, even at a conservative per-server registration overhead the stack would have cost ~65K tokens of context before any tool ran. We did not, and that is the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one package works
&lt;/h2&gt;

&lt;p&gt;Five decisions did the work. None of them are clever. All of them are boring tradeoffs that happen to compound.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Prefix namespacing instead of MCP-server-level scoping
&lt;/h3&gt;

&lt;p&gt;Every tool carries its product prefix in the name. The flat namespace makes the file you saw above legal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;purple_login, purple_whoami, purple_analyze_repo, ...
red_create_bug_report, red_analyze_bug_report, ...
blue_generate_e2e_test, blue_run_e2e_test, ...
green_convert_consultation_to_prd, ...
auth_agent_signup, auth_get_pricing
codens_register_project_unified
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We pay verbosity in the tool name. We get zero collision risk and one MCP process. I considered nested groupings (&lt;code&gt;codens.red.create_bug_report&lt;/code&gt; style), but flat names render cleaner in tool-use traces and grep better in logs. Worth it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Shared client code
&lt;/h3&gt;

&lt;p&gt;All five product clients live in one place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/codens_mcp/client/
  auth.py
  blue.py
  green.py
  red.py
  auth_helper.py    # JWT load/refresh, shared
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the part that does not show up in the token count but matters for the maintenance story. Five separate MCP packages would mean five copies of &lt;code&gt;auth_helper.py&lt;/code&gt; drifting independently. One package means one bug fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Single auth flow
&lt;/h3&gt;

&lt;p&gt;Auth Codens is the SSO root for the family, so the MCP server only ever speaks one login dialect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codens-mcp login        &lt;span class="c"&gt;# Device Code Flow, runs once&lt;/span&gt;
&lt;span class="c"&gt;# token persisted to ~/.purple-codens/credentials.json&lt;/span&gt;
&lt;span class="c"&gt;# every product client reads the same file&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The historical path is &lt;code&gt;~/.purple-codens/credentials.json&lt;/code&gt; because Purple shipped first and we did not want to break existing users by renaming. Cosmetic debt, zero functional cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Re-export pattern for Purple
&lt;/h3&gt;

&lt;p&gt;This is the move that kept us honest. Purple already had a standalone MCP package on PyPI (&lt;code&gt;purple-codens-mcp&lt;/code&gt;) before the unified server existed. We did not fork it. The unified package imports and re-registers Purple's tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# src/codens_mcp/tools/purple_tools.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;purple_codens_mcp.tools.project_tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_project_tools&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;purple_codens_mcp.tools.repo_tools&lt;/span&gt;    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_repo_tools&lt;/span&gt;
&lt;span class="c1"&gt;# ...four more imports
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;register_purple_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;_register_purple_auth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_purple_get_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;_register_projects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_purple_get_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;_register_repos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_purple_get_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Existing users of &lt;code&gt;purple-codens-mcp&lt;/code&gt; on PyPI keep working unchanged. &lt;code&gt;codens-mcp&lt;/code&gt; adds Red, Blue, Green, Auth, and Cross on top. One package can be fully replaced by the other without breaking anyone, which gave us a safe rollout.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Lazy execution
&lt;/h3&gt;

&lt;p&gt;The 4,720 tokens is the registration cost. Claude Code sees all 31 tool descriptors at startup. Each tool's actual HTTP call only fires on invocation, and the per-call response is bounded by the tool's own prompt (usually a few hundred tokens of JSON). The thing that scales linearly with use is the conversation transcript, not the registration. Bloat at startup is the lever; we pulled it once, and the rest of the session is unaffected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest tradeoffs
&lt;/h2&gt;

&lt;p&gt;Unified is not free. Three things we gave up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One process is one failure mode.&lt;/strong&gt; If &lt;code&gt;codens-mcp&lt;/code&gt; crashes, all five product surfaces are gone simultaneously. With separate MCPs each product gets its own isolation boundary and a Red bug cannot take down Green tooling. We accepted this because we are a small shop, the package is small, and a crash in production would tell us we have a much bigger problem than tool routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update cadence is coupled.&lt;/strong&gt; Shipping a new Red tool means cutting a new version of the whole package. Users get every product's churn whether they wanted it or not. We considered semver-per-product subnamespacing and rejected it because our internal release cadence is already weekly and roughly synchronized; the imaginary user who wants Red on a daily cycle but Green frozen does not exist for us yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission boundary is coarse at the MCP layer.&lt;/strong&gt; Authenticating once gives the user access to all 31 tools. You cannot tell Claude Code "allow Red but not Green" through the MCP descriptors alone. We solved this one level up: Auth Codens enforces role-based permissions on the server side, so even if the MCP exposes &lt;code&gt;green_create_kickoff&lt;/code&gt;, the API call rejects users who do not have the Green entitlement. The MCP becomes the surface; the gate lives elsewhere.&lt;/p&gt;

&lt;p&gt;"Unified is always right" is not the conclusion here. If you ship one MCP per oncall team and the teams release on different cycles, you are paying the token tax for a reason, and the isolation buys you something real. The unified shape worked for us because the products were already coupled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the token bloat actually comes from
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://twitter.com/akshay_pachaar" rel="noopener noreferrer"&gt;Akshay's follow-up tweet&lt;/a&gt; closes the loop:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The protocol was never the bottleneck. The loading strategy was."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the line I want every MCP author to internalize. The 55K-token figure is not what MCP-the-spec costs. It is what N separate handshakes plus N capability advertisements plus N redundant client preambles cost when you let your tools sprawl into N independent servers.&lt;/p&gt;

&lt;p&gt;Look at the math from the other direction. If five separate MCPs each carry a 10–15K registration footprint (one server's worth of capability JSON, instructions, schema bundles), you are at 50–75K before the model has done anything useful. Collapse the five servers to one and the registration overhead collapses too, because there is only one capability list, one instruction blob, one schema bundle, and the per-tool descriptor cost is small.&lt;/p&gt;

&lt;p&gt;The protocol is doing its job. The protocol is also fine with you stacking five copies of itself in your config file, because that is a user choice, not a spec smell. Treating MCP servers like microservices ("one per product, for isolation") is the analogue of running 30 Lambda cold starts where one process would do.&lt;/p&gt;

&lt;p&gt;We did not invent a new transport. We did not strip schemas. We just stopped paying for five handshakes when one would do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;Partition your MCP surface by domain, not by tool class. If five tools share an auth root, a release cadence, and a user mental model, they belong in one server. If they do not, split. The token cost is a downstream signal of how well that partition matches reality.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;codens-mcp&lt;/code&gt; is on PyPI: &lt;code&gt;pip install codens-mcp&lt;/code&gt;. Code lives at &lt;a href="https://github.com/codens-ai" rel="noopener noreferrer"&gt;github.com/codens-ai&lt;/a&gt;. If you want the user-facing pitch, that is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;codens.ai/en&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>claude</category>
      <category>python</category>
      <category>architecture</category>
    </item>
    <item>
      <title>"How one empty message poisoned an entire AI consultation (and the three-layer fix)"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Mon, 11 May 2026 05:33:11 +0000</pubDate>
      <link>https://dev.to/zoetaka38/how-one-empty-message-poisoned-an-entire-ai-consultation-and-the-three-layer-fix-57fb</link>
      <guid>https://dev.to/zoetaka38/how-one-empty-message-poisoned-an-entire-ai-consultation-and-the-three-layer-fix-57fb</guid>
      <description>&lt;p&gt;A user opened a support thread saying their AI consultation had gone unresponsive. Every message they sent came back with an error. Refreshing didn't help. Starting a new tab didn't help. From their side, the conversation was dead.&lt;/p&gt;

&lt;p&gt;The product is Codens Green, a PRD management tool where users hold long, iterative conversations with Claude to refine product requirements. Some of those conversations run dozens of turns. This particular one had thirty-something messages of history, all looking normal in the database. The row was there. The user was authenticated. The organization had credits. And yet every new message hit the API and bounced.&lt;/p&gt;

&lt;p&gt;By the time we shipped the fix it was three layers deep, and only one of those layers is the "actual" fix. The other two were the kind of belt-and-suspenders you only put on once you've been burned. I want to walk through what we saw, what we tried first (which was wrong), what the real cause turned out to be, and the shape of the patch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 400 BadRequest looked like
&lt;/h2&gt;

&lt;p&gt;The backend log for the failing consultation looked like this on every request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;ERROR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Failed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;generate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;AI&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;response:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;code:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;'type':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'invalid_request_error'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="err"&gt;'message':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'messages.&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;content&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;blocks&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;must&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;be&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;non-empty'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same error, same index, every time. The user retried, our code retried, the error didn't move. Index 17 was always index 17 because index 17 was sitting in their stored history.&lt;/p&gt;

&lt;p&gt;I went down the wrong path first. The error code was 400, which felt like an auth-shaped problem, so I started there. Wrong key? The key was fine, every other org was working. Rate limit? No, this org wasn't anywhere close. Model deprecation? We were on a current model, and other consultations using the exact same model were responding normally. I checked the Anthropic status page. Green across the board. I checked our own credit-deduction logic to make sure we weren't somehow short-circuiting requests. Clean.&lt;/p&gt;

&lt;p&gt;About forty minutes in I noticed the &lt;code&gt;messages.17&lt;/code&gt; part of the error and felt stupid. The API was telling me exactly which message in the array it didn't like. I just hadn't read it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real cause
&lt;/h2&gt;

&lt;p&gt;I pulled the consultation row, parsed its &lt;code&gt;messages&lt;/code&gt; JSON, and walked it. Most messages had a few hundred characters of content. Message 17, an assistant message, had &lt;code&gt;content: ""&lt;/code&gt;. Empty string. Not whitespace, not null, just empty.&lt;/p&gt;

&lt;p&gt;Claude's API rejects requests where any message in the &lt;code&gt;messages&lt;/code&gt; array has empty content. That's a hard validation at the boundary, not a soft failure. Which meant: the moment that empty message landed in the consultation's history, every future call was guaranteed to fail, because every future call assembled the full history and sent it back to the API. The conversation had been poisoned by one row.&lt;/p&gt;

&lt;p&gt;The user couldn't recover from inside the app. Our UI didn't expose a "delete message" affordance for this surface, and even if it did, the broken message was an assistant turn, not theirs to edit. From the user's perspective, the consultation just stopped working. Forever. With no error message that meant anything to them.&lt;/p&gt;

&lt;p&gt;This is the worst kind of bug. It only surfaces for users with enough history to have triggered the rare condition that produced the bad row, the dashboards don't flag it (a 400 from Claude looks like an intermittent upstream failure if you don't drill in), and the root cause is invisible because it happened on some earlier request you weren't watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  How an empty assistant message ever got saved
&lt;/h2&gt;

&lt;p&gt;Once I knew what to look for, the chain was straightforward.&lt;/p&gt;

&lt;p&gt;Claude's API occasionally returns a response where the assistant's &lt;code&gt;text_content&lt;/code&gt; is empty. I don't have a great theory for why. Could be transient, could be an edge case in their content filtering, could be a race in how we parse &lt;code&gt;content&lt;/code&gt; blocks when the response has tool-use blocks but no text blocks. It's rare. I'd guess less than one in ten thousand calls in our traffic. But across enough users and enough turns, "rare" becomes "guaranteed."&lt;/p&gt;

&lt;p&gt;Our previous code did approximately this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ai_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_claude_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_consultation_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ai_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_assistant_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ai_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ai_metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ai_response&lt;/code&gt; could be &lt;code&gt;""&lt;/code&gt;. Nothing checked. The empty string flowed into &lt;code&gt;add_assistant_message&lt;/code&gt;, got appended to the message list, and the entity got persisted. From that point forward, the consultation was permanently broken.&lt;/p&gt;

&lt;p&gt;One unchecked write, two days earlier, became a permanent block on the user's account.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three-layer fix
&lt;/h2&gt;

&lt;p&gt;The patch split into three layers. Each one defends a different boundary, and only the middle one is what I'd call the real fix. The other two are there because the real fix doesn't help users who already have a poisoned row, and because I wanted to bound the failure surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: filter on the way out
&lt;/h3&gt;

&lt;p&gt;In the &lt;code&gt;Consultation&lt;/code&gt; domain entity, &lt;code&gt;get_messages_for_ai()&lt;/code&gt; is what assembles the array we send to Claude. The old version included every non-system message. The new version also excludes anything with empty or whitespace-only content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_messages_for_ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;MessageRole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SYSTEM&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the layer that unsticks every existing poisoned consultation. We didn't run a data migration. We didn't write a one-shot cleanup script. The filter at read time simply skips the bad row on the way to the API, and the conversation works again. The bad row is still sitting in the DB, but it's never sent anywhere that would reject it.&lt;/p&gt;

&lt;p&gt;I want to be honest about what this layer is and isn't. It's defensive. It papers over bad data. It does not prevent the bug from happening again. If you only ship this layer, you keep generating empty rows and keep skipping them, which is fine until something else relies on the history being complete (PRD generation from conversation summary, for instance) and now the user's PRD is missing a turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: detect on the way in
&lt;/h3&gt;

&lt;p&gt;This is the real fix. In our Claude client wrapper, &lt;code&gt;generate_consultation_response()&lt;/code&gt; now refuses to return an empty response at all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;text_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No text content in Claude API response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Claude hands us back a response with no text blocks (or only empty text blocks), we raise. The caller in &lt;code&gt;AddMessageUseCase&lt;/code&gt; already has a try/except around the API call and falls back to a generic "sorry, please try again" message. Crucially, that fallback message goes to the user as a transient response. It does not get persisted as an assistant turn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;consultation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_messages_for_ai&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ai_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_claude_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_consultation_response&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="n"&gt;ai_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to generate AI response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ai_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;申し訳ありません。AIからの応答の生成中にエラーが発生しました。...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait, that's not quite right as stated. Look at the existing code and you'll see the fallback message does get persisted via &lt;code&gt;add_assistant_message&lt;/code&gt; further down. That's a separate concern we'll come back to. What matters here is that with Layer 2 in place, the assistant message that gets stored on a failed call is either real text or our explicit, non-empty fallback string. It is never &lt;code&gt;""&lt;/code&gt;. The DB cannot accumulate another poisoned row from this code path.&lt;/p&gt;

&lt;p&gt;If you can only ship one of the three layers, ship this one. Defending at the output boundary, the moment data crosses from "external API response" into "thing we persist," is where bad data deserves to die. Filtering at read time is a workaround. Validating at write time is the fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: bound the history
&lt;/h3&gt;

&lt;p&gt;This one is technically a separate bug, but I shipped it in the same PR series because the user-visible symptom overlaps. Long consultations were starting to push against the context window, and a few users were seeing failures that looked similar (intermittent API errors on long-running conversations) but had a different cause.&lt;/p&gt;

&lt;p&gt;So in &lt;code&gt;AddMessageUseCase&lt;/code&gt;, we cap the history we send:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_HISTORY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_HISTORY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;MAX_HISTORY&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Forty messages is roughly twenty user/assistant turns. The trailing slice gets the most recent context, which is almost always what matters. The &lt;code&gt;while&lt;/code&gt; loop handles a Claude API requirement that conversations must start with a user role. If the slice happens to begin with an assistant message (because we truncated mid-turn), we drop the leading assistants until we find a user message.&lt;/p&gt;

&lt;p&gt;Three things to flag about Layer 3. First, twenty turns is a product choice, not a technical limit; we picked it because our consultation UI doesn't show more than that comfortably anyway, and longer histories were producing diminishing returns on AI quality. Second, the first-user-role correction is a Claude-specific constraint. Don't carry this verbatim to a different provider without checking their docs. Third, this layer is unrelated to the empty-message bug. It's bundled in because the failure mode looks adjacent from a triage perspective, and shipping them together meant one round of regression testing instead of two.&lt;/p&gt;

&lt;h2&gt;
  
  
  The migration we didn't write
&lt;/h2&gt;

&lt;p&gt;One thing I want to underline. Layer 1, the read-time filter, accidentally did the work of a data migration without being a data migration. Every existing poisoned consultation in our DB started working again the moment the deploy went out. No SQL to write, no rows to update, no offline job to run. The defensive layer absorbed the historical damage.&lt;/p&gt;

&lt;p&gt;That's not always the right tradeoff. If we'd needed downstream consumers (analytics, PRD generation, exports) to see a complete history, leaving bad rows in place would have leaked into those features later. In our case the only consumer that read the bad message was the call to Claude itself, so filtering at read time was sufficient. But it's worth naming the pattern explicitly: a defensive read-side filter can serve as a zero-downtime migration for a class of bad data, as long as you're confident you've enumerated every reader.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd take away
&lt;/h2&gt;

&lt;p&gt;The thing I keep coming back to is that the cause of the user's problem (one empty cell, written two days earlier, somewhere on the request path) had nothing visible in common with the symptom they were experiencing (every new message fails with a 400 today). The signal that mattered was buried in the error message itself, and I spent forty minutes chasing API keys before I read it. Read the error.&lt;/p&gt;

&lt;p&gt;The three-layer shape, defend on the way in, defend on the way out, bound the size, is general. It works for any case where you're persisting outputs from an external API and replaying them as inputs. Validate before you persist. Filter before you replay. Cap the surface.&lt;/p&gt;

&lt;p&gt;If you're building anything with Claude, &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;Codens&lt;/a&gt; is what we use this same stack to build.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>python</category>
      <category>debugging</category>
    </item>
    <item>
      <title>"Persisting your real Chrome login across Playwright restarts on macOS"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Sun, 10 May 2026 02:08:29 +0000</pubDate>
      <link>https://dev.to/zoetaka38/persisting-your-real-chrome-login-across-playwright-restarts-on-macos-126a</link>
      <guid>https://dev.to/zoetaka38/persisting-your-real-chrome-login-across-playwright-restarts-on-macos-126a</guid>
      <description>&lt;p&gt;Every macOS reboot, the same ritual. Open the Playwright-controlled Chrome window, see seven publishing tabs all logged out, and spend the next ten minutes typing passwords and tapping the Google account picker. Zenn, dev.to, note, Substack, X, LinkedIn, the Google Search Console dashboard. All gone, all needing the same Google SSO dance through my corevice.com workspace account.&lt;/p&gt;

&lt;p&gt;I run a one-person GTM operation for Codens and the publishing pipeline is entirely Playwright-driven. &lt;code&gt;npx @playwright/cli@latest&lt;/code&gt; opens a real Chrome with a persistent profile, and a stack of small scripts paste titles and bodies into each editor. It works beautifully until the host reboots and the user-data-dir at &lt;code&gt;/tmp/chrome-pw-corevice&lt;/code&gt; evaporates with the rest of &lt;code&gt;/tmp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I finally sat down and fixed it. The result is a thirty-line shell script that clones my daily-driver Chrome profile into the Playwright tmpdir on every launch, with two non-obvious tricks that make the cookies actually decrypt. This post is about those two tricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the obvious copy doesn't work
&lt;/h2&gt;

&lt;p&gt;The first thing anyone tries is the obvious thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; ~/Library/Application&lt;span class="se"&gt;\ &lt;/span&gt;Support/Google/Chrome/Default &lt;span class="se"&gt;\&lt;/span&gt;
      /tmp/chrome-pw-corevice/Default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it, fire up Playwright, and Chrome opens looking like it has my profile. History is there. Bookmarks are there. Extensions are there. But every site is logged out, and the cookie jar in DevTools is either empty or full of cookies that don't authenticate anything.&lt;/p&gt;

&lt;p&gt;The reason is that Playwright launches Chrome with two flags I didn't know about until I started digging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--use-mock-keychain&lt;/span&gt;
&lt;span class="nt"&gt;--password-store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;basic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those flags tell Chrome to bypass the macOS Keychain entirely and use a hardcoded mock encryption key for cookies and the password store. From Playwright's point of view this is the right default. CI runners don't have a real keychain. Headless containers don't have a real keychain. The mock makes Chrome boot reliably in places where Keychain Access doesn't exist.&lt;/p&gt;

&lt;p&gt;But for me, this is exactly wrong. The cookies my daily-driver Chrome wrote to disk were encrypted with the real keychain key, the one Chrome stored under "Chrome Safe Storage" in my login keychain on first install. The cookies that just got copied over are still encrypted with that real key. Playwright's Chrome boots with the mock key, tries to decrypt them, gets garbage, and silently treats every cookie as invalid.&lt;/p&gt;

&lt;p&gt;I tried &lt;code&gt;storageState&lt;/code&gt; first, which is the documented Playwright path for this. Export cookies and localStorage from one context, inject into another. It works for some sites and dies for others. Substack stalled at the Google SSO redirect and never finished the auth handshake. Note's editor wanted a CSRF token tied to a session cookie that storageState had captured but which the server no longer accepted, presumably because the session was bound to the original UA fingerprint. After the third site failed in a different way I gave up on storageState and went back to cloning the whole profile.&lt;/p&gt;

&lt;p&gt;So: two real fixes are needed. Make Playwright's Chrome speak the same encryption language as my daily Chrome, and copy the cookie database in a way that doesn't corrupt it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix one, patch the keychain flag
&lt;/h2&gt;

&lt;p&gt;Playwright's CLI bundles its Chrome launch arguments inside &lt;code&gt;playwright-core/lib/coreBundle.js&lt;/code&gt;. When you run &lt;code&gt;npx @playwright/cli@latest&lt;/code&gt;, npm caches that file under &lt;code&gt;~/.npm/_npx/&amp;lt;hash&amp;gt;/node_modules/playwright-core/lib/coreBundle.js&lt;/code&gt;. The file is huge and minified, but the two strings I care about appear verbatim:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="s2"&gt;"--use-mock-keychain"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="s2"&gt;"--password-store=basic"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;sed&lt;/code&gt; rewrite is enough. Swap the first to &lt;code&gt;--use-real-keychain&lt;/code&gt; and the second to &lt;code&gt;--password-store=keychain&lt;/code&gt;. Chrome on macOS recognizes both, and once they're in place the launched Chrome reads its encryption key from the same login keychain entry as my daily-driver Chrome. The cookies decrypt. SSO holds.&lt;/p&gt;

&lt;p&gt;The patch wants to be idempotent because npx happily re-extracts the package if it gets purged from the cache, and I don't want to re-edit the file by hand each time. So the script does three things. It locates the bundle with &lt;code&gt;find&lt;/code&gt;. It checks whether the bundle still contains &lt;code&gt;--use-mock-keychain&lt;/code&gt;, which means it hasn't been patched yet. If so, it makes a &lt;code&gt;.bak&lt;/code&gt; copy on first patch and runs &lt;code&gt;sed -i ''&lt;/code&gt; in place.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.bak&lt;/code&gt; is the escape hatch. If a future Playwright update changes those flags or relies on the mock keychain elsewhere and my patch breaks something, I can &lt;code&gt;mv coreBundle.js.bak coreBundle.js&lt;/code&gt; and be back to stock in one command.&lt;/p&gt;

&lt;p&gt;The first time you launch the patched Chrome, macOS will pop a Keychain Access dialog asking you to allow access to "Chrome Safe Storage." Click Always Allow. After that, no prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix two, SQLite backup for the cookie file
&lt;/h2&gt;

&lt;p&gt;With the keychain flag patched, the next failure mode is more subtle. Sometimes the cookies decrypt, sometimes they don't, and when they don't, the SQLite file looks corrupt. Chrome refuses to read it and silently starts a fresh empty cookie jar.&lt;/p&gt;

&lt;p&gt;Chrome's &lt;code&gt;Cookies&lt;/code&gt; file is a SQLite database. My daily-driver Chrome is almost always running, which means it's holding write locks on that database, and depending on timing it may have a partial write in progress when &lt;code&gt;cp&lt;/code&gt; reads the file. The result is a torn copy: the bytes are physically there, but the SQLite page checksums don't match the WAL log, and SQLite refuses to open it.&lt;/p&gt;

&lt;p&gt;The right tool for snapshotting a live SQLite database is the &lt;code&gt;.backup&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sqlite3 &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SOURCE_PROFILE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/Cookies"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;".backup &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/Cookies"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't just a smarter copy. It uses SQLite's online backup API, which acquires a read lock, copies pages in a way that's transactionally consistent with the source database's current state, and produces a target file that opens cleanly. You can run it while Chrome is actively writing to the source. The output is always a valid database.&lt;/p&gt;

&lt;p&gt;The script removes the stale &lt;code&gt;Cookies&lt;/code&gt; and &lt;code&gt;Cookies-journal&lt;/code&gt; files first, then runs &lt;code&gt;.backup&lt;/code&gt; on every launch. That way the cookie jar is always fresh, even if I haven't rebooted but I have used my daily Chrome to log into a new site since the last Playwright session.&lt;/p&gt;

&lt;h2&gt;
  
  
  The script
&lt;/h2&gt;

&lt;p&gt;The whole thing is at &lt;code&gt;runbooks/launch/playwright-launch.sh&lt;/code&gt; in my GTM repo. Roughly thirty lines if you don't count comments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="c"&gt;# Launch playwright-cli against the corevice.com Chrome profile copy.&lt;/span&gt;
&lt;span class="c"&gt;# Idempotent: if profile copy missing, re-creates it; if patch missing, re-applies.&lt;/span&gt;
&lt;span class="c"&gt;# Usage: ./playwright-launch.sh open &amp;lt;url&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;#        ./playwright-launch.sh &amp;lt;command&amp;gt; [args...]&lt;/span&gt;

&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;PW_CACHE_BASE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/.npm/_npx"&lt;/span&gt;
&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/tmp/chrome-pw-corevice"&lt;/span&gt;
&lt;span class="nv"&gt;SOURCE_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Library/Application Support/Google/Chrome"&lt;/span&gt;

ensure_profile_copy&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/Cookies"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[setup] copying profile..."&lt;/span&gt;
    &lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default"&lt;/span&gt;
    rsync &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Cache'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Code Cache'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'GPUCache'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Service Worker'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'ShaderCache'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'GraphiteDawnCache'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'component_crx_cache'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'extensions_crx_cache'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Sessions'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'File System'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'blob_storage'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Cookies'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--exclude&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Cookies-journal'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SOURCE_PROFILE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/"&lt;/span&gt;
    &lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SOURCE_PROFILE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Local State"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/"&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true
    cp&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SOURCE_PROFILE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/First Run"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/"&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true
  &lt;/span&gt;&lt;span class="k"&gt;fi&lt;/span&gt;

  &lt;span class="c"&gt;# Always refresh cookies via SQLite .backup (safe with Chrome running)&lt;/span&gt;
  &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/Cookies"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/Cookies-journal"&lt;/span&gt;
  sqlite3 &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SOURCE_PROFILE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/Cookies"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;".backup &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Default/Cookies"&lt;/span&gt; 2&amp;gt;/dev/null
&lt;span class="o"&gt;}&lt;/span&gt;

ensure_patch&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;cb
  &lt;span class="nv"&gt;cb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;find &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_CACHE_BASE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-path&lt;/span&gt; &lt;span class="s1"&gt;'*playwright-core/lib/coreBundle.js'&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-z&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$cb&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[setup] @playwright/cli not yet installed; npx -y will install it"&lt;/span&gt;
    &lt;span class="k"&gt;return
  fi
  if &lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="s1"&gt;'--use-mock-keychain'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$cb&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[setup] patching playwright to use real keychain..."&lt;/span&gt;
    &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;cb&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.bak"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$cb&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;cb&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.bak"&lt;/span&gt;
    &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s|"--password-store=basic"|"--password-store=keychain"|'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s|"--use-mock-keychain",|"--use-real-keychain",|'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$cb&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;fi&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

ensure_profile_copy
ensure_patch

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/.asdf/shims:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="k"&gt;:-}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"open"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;shift
  exec &lt;/span&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @playwright/cli@latest open &lt;span class="nt"&gt;--headed&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--browser&lt;/span&gt; chrome &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @playwright/cli@latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--browser&lt;/span&gt; chrome &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PW_PROFILE_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few notes on the rsync exclude list. All the cache directories are excluded because they're large, regenerable, and sometimes hold OS-specific binary blobs that Chrome will rebuild on first launch. &lt;code&gt;Sessions&lt;/code&gt; is excluded so Playwright's Chrome doesn't try to restore tabs from my daily browsing. &lt;code&gt;File System&lt;/code&gt; and &lt;code&gt;blob_storage&lt;/code&gt; are excluded for size. &lt;code&gt;Cookies&lt;/code&gt; and &lt;code&gt;Cookies-journal&lt;/code&gt; are excluded specifically because we handle them via &lt;code&gt;.backup&lt;/code&gt; immediately after the rsync, and we want that to be the authoritative copy.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Local State&lt;/code&gt; and &lt;code&gt;First Run&lt;/code&gt; are copied separately. &lt;code&gt;Local State&lt;/code&gt; is where Chrome stores the encrypted master key reference and a few profile-level settings. &lt;code&gt;First Run&lt;/code&gt; is a sentinel file that suppresses the first-run wizard.&lt;/p&gt;

&lt;p&gt;The patched diff itself is two lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- "--use-mock-keychain",
&lt;/span&gt;&lt;span class="gi"&gt;+ "--use-real-keychain",
&lt;/span&gt;&lt;span class="gd"&gt;- "--password-store=basic"
&lt;/span&gt;&lt;span class="gi"&gt;+ "--password-store=keychain"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole keychain fix. Two strings.&lt;/p&gt;

&lt;h2&gt;
  
  
  A side issue, headless mode breaks the clipboard
&lt;/h2&gt;

&lt;p&gt;The script forces &lt;code&gt;--headed&lt;/code&gt; for the &lt;code&gt;open&lt;/code&gt; subcommand, and there's a story behind that. My publish scripts work by &lt;code&gt;pbcopy&lt;/code&gt;-ing the title and body into the clipboard, focusing the editor field via Playwright, and then sending Cmd+V. CodeMirror, Substack's editor, dev.to's editor — they all behave better with a real paste than with &lt;code&gt;type()&lt;/code&gt; calls that fire individual keypress events. Markdown formatting survives. Code blocks stay intact. Smart-quote autocorrect doesn't fire.&lt;/p&gt;

&lt;p&gt;But headless Chromium doesn't have a system clipboard. &lt;code&gt;navigator.clipboard.readText()&lt;/code&gt; returns empty, the paste handler sees no data, and the form silently stays empty. I lost an hour to that one before realizing the &lt;code&gt;open&lt;/code&gt; command was defaulting to headless mode in the version of &lt;code&gt;@playwright/cli&lt;/code&gt; I was on. Forcing &lt;code&gt;--headed&lt;/code&gt; makes the daemon run as a real Chrome window with full clipboard access, which is what I want anyway because I sometimes want to glance at the publish flow while it's running.&lt;/p&gt;

&lt;p&gt;The non-&lt;code&gt;open&lt;/code&gt; commands pass through unchanged, so anything else that wants headless behavior still gets it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it's worth
&lt;/h2&gt;

&lt;p&gt;Five to ten minutes of manual relogins per reboot, multiplied by however often macOS decides to update overnight. Across a year that's hours I get back, and more importantly the publish scripts now run unattended. I push a draft, the script opens the right tab, pastes the right content, and I review the rendered preview before clicking publish.&lt;/p&gt;

&lt;p&gt;If you're running a similar setup, the script is generic. Change &lt;code&gt;SOURCE_PROFILE&lt;/code&gt; if you use a non-Default Chrome profile, change &lt;code&gt;PW_PROFILE_DIR&lt;/code&gt; if you don't trust &lt;code&gt;/tmp&lt;/code&gt; to survive your reboot policy, and the rest should work.&lt;/p&gt;

&lt;p&gt;This is the kind of small infrastructure work that makes solo operations possible. We build a lot of these at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;Codens&lt;/a&gt;, where the day job is wiring AI agents into the same kind of publishing and dev pipelines.&lt;/p&gt;

</description>
      <category>playwright</category>
      <category>chrome</category>
      <category>macos</category>
      <category>automation</category>
    </item>
    <item>
      <title>"Why your long-running AI agent feels broken (even when it isn't)"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Fri, 08 May 2026 04:06:22 +0000</pubDate>
      <link>https://dev.to/zoetaka38/why-your-long-running-ai-agent-feels-broken-even-when-it-isnt-252j</link>
      <guid>https://dev.to/zoetaka38/why-your-long-running-ai-agent-feels-broken-even-when-it-isnt-252j</guid>
      <description>&lt;p&gt;A support ticket came in last month with the subject line "the plan generator is broken." It was not, in fact, broken. The Celery task was running. The downstream service had accepted the job. The database row was sitting there with &lt;code&gt;generation_status = 'in_progress'&lt;/code&gt; exactly as designed. From the server's point of view, the system was healthy.&lt;/p&gt;

&lt;p&gt;From the user's point of view, they had clicked a button fifteen minutes ago and nothing had happened since.&lt;/p&gt;

&lt;p&gt;I run Codens, a small AI dev harness, mostly solo. We have a product called Green Codens that turns Product Requirements Documents into actionable dev plans. The plan generation is a long-running AI job. It can take 30 seconds for a tiny repo or 30+ minutes for a sprawling one. We had built two completion paths: a webhook for the happy case and a polling fallback for when the webhook missed. The webhook had silently failed during a deploy. The polling fallback was scheduled to make its first call fifteen minutes after submission.&lt;/p&gt;

&lt;p&gt;We changed two numbers. The same workflow now feels roughly fifteen times faster. Total compute is basically unchanged. This post is about why those two numbers mattered so much, and what they imply about designing async UX in AI products in general.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually happening
&lt;/h2&gt;

&lt;p&gt;Green Codens does the PRD authoring side. A separate service we call Purple Codens does the heavier lifting: cloning the repo, reading code, running an analysis agent, producing a structured task list. When a user converts a PRD into a dev plan, Green submits an analyze job to Purple, gets a &lt;code&gt;202 Accepted&lt;/code&gt; and a job id back, and then has to wait for the result.&lt;/p&gt;

&lt;p&gt;There are two completion paths.&lt;/p&gt;

&lt;p&gt;The first is a webhook, which is just the server-to-server "I'm done" callback. When Purple finishes, it POSTs the result back to Green with a signature, and Green applies it to the plan row. This is the happy path and it usually works.&lt;/p&gt;

&lt;p&gt;The second path is a polling fallback. Webhooks miss for boring reasons. A receiver might be mid-deploy and bouncing 503s for thirty seconds. A signing key rotation might leave one side temporarily unable to verify the other. A network blip might drop the request and the sender's retry policy might give up before the receiver is back. None of these are exotic. All of them happen in real production systems. So Green also runs a Celery task that wakes up periodically, asks Purple "hey, what's the status of job X?", and applies the result if the job is done.&lt;/p&gt;

&lt;p&gt;The polling task is idempotent. If the webhook already applied the result, the polling task sees &lt;code&gt;generation_status = 'completed'&lt;/code&gt; and is a no-op. If the webhook missed, the polling task is the safety net that catches the dropped result.&lt;/p&gt;

&lt;p&gt;Here is what the original schedule looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Original (bad)
&lt;/span&gt;&lt;span class="n"&gt;_INITIAL_COUNTDOWN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;900&lt;/span&gt;  &lt;span class="c1"&gt;# wait 15 minutes before first poll
&lt;/span&gt;&lt;span class="n"&gt;_RETRY_COUNTDOWN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;    &lt;span class="c1"&gt;# then poll every 5 minutes, up to 12 times
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total polling window: 15 + 12 × 5 = 75 minutes. The reasoning was server-side and superficially sensible. Most analyses on real customer repos finish somewhere in the 10 to 20 minute range. Polling earlier than 15 minutes "wastes" API calls on jobs that are obviously still running. Polite. Considerate. Reasonable in isolation.&lt;/p&gt;

&lt;p&gt;The problem was that the user does not live in the server's frame of reference. The user clicks the button, sees a "your plan is being analyzed..." spinner, and then the front-end is silent. If the webhook fires, great, the spinner becomes a result. If the webhook does not fire, the user sits with that spinner for a full fifteen minutes before any other code path even tries to discover the truth. They reload the page. They check the network tab. They contact us. By the time the polling fallback fires its first request, the user has already decided we are broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  The retry-design trap
&lt;/h2&gt;

&lt;p&gt;When you reach for retry logic in any system, the default mental model most engineers grab is "start short, double each time, give up at some bound." If you have ever written &lt;code&gt;time.sleep(2 ** attempt)&lt;/code&gt; you have used it. It is taught early, it appears in HTTP client libraries, it ships in AWS SDKs by default. It is the right answer to a real problem.&lt;/p&gt;

&lt;p&gt;But it is the right answer to a specific problem: you are calling something that is probably failing, and you do not want to hammer it while it is on fire. Each retry is a fresh attempt at the same operation. You assume the remote side might be temporarily unable to serve you, you give it space to recover, and you increase the wait between attempts so that if the outage is long, you are not piling on. The pattern protects the server from you.&lt;/p&gt;

&lt;p&gt;The polling fallback in Green is doing something different. The job we are checking on is, in the overwhelming majority of cases, completely healthy. It started running a few minutes ago. It is going to finish on its own. The only reason we are polling at all is to catch the rare case where Purple finished, told us about it, and the message did not get through. We are not retrying a failing call. We are scanning for a missed event.&lt;/p&gt;

&lt;p&gt;Once you frame it that way, the standard retry shape becomes obviously wrong. Starting short and lengthening makes sense when "short" means "give the failing thing a moment to recover." That is not what we are doing. We are saying "did the message arrive yet?" There is no recovery happening on the other side, because the other side is fine. Waiting longer between checks does not help anyone. It just delays the moment we notice the missed message.&lt;/p&gt;

&lt;p&gt;If you stay with the standard shape and just shorten the initial wait, you end up over-polling at the tail. A job that legitimately takes 35 minutes does not need someone tapping it on the shoulder every 30 seconds for the back half of its run. That actually does spend API calls and Celery worker capacity for no information gain.&lt;/p&gt;

&lt;p&gt;The shape we wanted was something the standard pattern does not provide a good vocabulary for. Aggressive at the start. Calmer at the end. Inverted from the usual instinct. Every framing I tried for it (front-loaded, decaying, head-heavy) sounded jargony and made the actual idea harder to talk about than it deserved. So I will skip the label entirely and just describe the shape.&lt;/p&gt;

&lt;p&gt;We want the first poll within roughly a minute of submitting the job, because the cost of a missed webhook is measured in the user's emotional clock. We want a tight cluster of polls in the first five minutes, because that is the window in which essentially every kind of webhook failure manifests. Then we want to space out, because once you are ten minutes into a healthy job, the user has already accepted that this is going to take a while, and quick polling buys nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers after the change
&lt;/h2&gt;

&lt;p&gt;Here is the new schedule, lifted from &lt;code&gt;poll_purple_analyze_job.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Polling window = 60s initial + sum(_RETRY_BACKOFFS) ≈ 73 min total.
# Front-loaded so a missed webhook is noticed within ~2 minutes.
&lt;/span&gt;&lt;span class="n"&gt;_RETRY_BACKOFFS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;240&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;_MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_RETRY_BACKOFFS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The submitting task schedules the first poll with &lt;code&gt;countdown=60&lt;/code&gt; instead of &lt;code&gt;countdown=900&lt;/code&gt;. Each retry uses the next entry in the array as its countdown. Once the array is exhausted, the task gives up and marks the plan as failed so the UI can exit the loading state.&lt;/p&gt;

&lt;p&gt;The total budget is almost identical to the old design. Old: 15 + 12 × 5 = 75 minutes. New: 1 + 1 + 2 + 4 + (8 × 8) = 72 minutes. Both cover the long tail of legitimately long analyses with room to spare. Both stop somewhere around the 70-minute mark, which is where we have decided that further waiting is not actually going to produce a useful result and the right move is to surface the failure and let the user retry from the PRD page.&lt;/p&gt;

&lt;p&gt;What changed is the distribution of those minutes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time to first poll&lt;/td&gt;
&lt;td&gt;15 min&lt;/td&gt;
&lt;td&gt;1 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worst-case missed-webhook detection&lt;/td&gt;
&lt;td&gt;15 min&lt;/td&gt;
&lt;td&gt;2 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Polls in the first 5 minutes&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Polls in the first 10 minutes&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total polling budget&lt;/td&gt;
&lt;td&gt;75 min&lt;/td&gt;
&lt;td&gt;72 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Polls at the long tail (every interval)&lt;/td&gt;
&lt;td&gt;5 min&lt;/td&gt;
&lt;td&gt;8 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most important row in that table is the second one. Worst case detection went from fifteen minutes to two. That is a roughly 7.5× improvement in the time it takes the system to notice that a webhook went missing. For users who hit this path, that translates directly into how long they sit watching nothing happen.&lt;/p&gt;

&lt;p&gt;Why is two minutes the right ceiling for missed-webhook detection? It comes from looking at how webhook failures actually present in our environment. Configuration errors and signature mismatches surface on the very first request, because the verification step is deterministic and the same key is used every time. Network blips, deploy bounces, and 5xx storms are short-lived. We have never seen a webhook failure pattern in production that took more than a couple of minutes to show up. So if we have not heard back within the first five-ish minutes of polling, the failure is one of the loud, immediate kinds, and it is already in our logs. If the webhook does eventually arrive late, the polling task is idempotent and skips out as soon as it sees the plan resolved.&lt;/p&gt;

&lt;p&gt;Conversely, the long tail is where polite polling actually pays off. Once a job has been running for ten minutes and is still in &lt;code&gt;in_progress&lt;/code&gt;, you are probably looking at one of the genuinely slow analyses. Polling that every 30 seconds does nothing useful and just clutters logs. Eight-minute intervals at the tail give the job room to finish on its own and only check in occasionally.&lt;/p&gt;

&lt;p&gt;The dispatch in the submitting task is a single line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;poll_purple_analyze_job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze_job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;analyze_job_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organization_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;purple_org_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;countdown&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# was 900
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single number, 900 to 60, is most of the user-facing improvement. The array reshape is what protects the server from the consequences.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deeper lesson
&lt;/h2&gt;

&lt;p&gt;The thing I keep coming back to after this change is how much of "this product feels good" turns out to be set in the first sixty to ninety seconds of any long async operation.&lt;/p&gt;

&lt;p&gt;A user clicking "generate plan" is making a small bet. They believe, tentatively, that this is going to work. They are willing to wait. But they need the system to keep that belief warm, and the way you keep it warm is by giving them a sign of life early. It can be a progress bar that moves. It can be a status string that updates. It can be, in our case, a backend that quickly notices when something has gone wrong and surfaces the truth instead of letting the spinner spin.&lt;/p&gt;

&lt;p&gt;What the system absolutely cannot do is stay silent for fifteen minutes. By minute three the user has already started constructing a story about what is broken. By minute five they are looking for a way to cancel. By minute ten they have moved on and the next time they come back they will arrive expecting failure. Even if the webhook eventually fires at minute twelve and everything works, the experience has been spent.&lt;/p&gt;

&lt;p&gt;The original 15-minute initial wait was reasoning about the wrong thing. It was optimizing the API call profile against the modal completion time of the underlying job. That is a real number and it is a real consideration, but it is not the constraint that should drive the polling cadence. The constraint that should drive the polling cadence is "how long can the user sit in front of a silent screen before they conclude we are broken." For our users, that number is somewhere between 60 and 90 seconds. Past that, you are losing them.&lt;/p&gt;

&lt;p&gt;This generalizes. Any time you have a long-running async AI task, somewhere in the system there is a piece of code that decides how often the rest of the system asks "is it done yet." That code is a UX decision, not a backend decision. Treat it that way.&lt;/p&gt;

&lt;p&gt;The framing I now use when reviewing this kind of code is to separate two distinct questions and answer them separately. Question one: how quickly do we need to detect that the happy path failed? That governs the early polling cadence. The answer is almost always "faster than you think," because the happy path failing silently is the worst experience the system can produce. Question two: how patiently can we wait for the work to finish on its own? That governs the late polling cadence. The answer is usually "more patiently than you think," because once the user has accepted the wait, polling more often does not buy anyone anything.&lt;/p&gt;

&lt;p&gt;Server politeness is a real cost, and I do not want to pretend otherwise. Hammering an internal API every five seconds for an hour wastes capacity and clutters dashboards. But you weigh it against the perception cost. For a small B2B SaaS like ours, a single user concluding the product is broken and ghosting is far more expensive than any conceivable amount of well-bounded internal polling traffic. We are on a private API to our own service. The economics are not even close.&lt;/p&gt;

&lt;p&gt;We added a single line to our internal design checklist as a result of this work: "First poll inside 60 seconds." When we review any new long-running async flow, that line gets checked. If we are scheduling the first liveness check more than a minute after submission, we have to justify it explicitly, in writing, against the user-perception cost. So far we have not had a single case that survived that justification.&lt;/p&gt;

&lt;h2&gt;
  
  
  What else got fixed along the way
&lt;/h2&gt;

&lt;p&gt;A couple of things came along for the ride in the same PR, because once you start looking at one polling task you tend to notice the things around it.&lt;/p&gt;

&lt;p&gt;The polling task now has an explicit "give up" path that marks the plan as failed when the retry array is exhausted. The original code logged a warning and exited. The plan row stayed in &lt;code&gt;in_progress&lt;/code&gt; forever, which meant the UI loading state never resolved and the user could not even retry generation, because the front-end refused to start a new job while the previous one was supposedly still running. The fix is small but important: when retries hit the wall, write an explanatory error message to the plan, mark it failed, and publish a status-change event so the UI exits the spinner. The error message tells the user how long we waited and suggests retrying from the PRD detail page. It is also idempotent, so if the webhook arrives late and resolves the plan as completed, the giveup path sees &lt;code&gt;generation_status&lt;/code&gt; is no longer &lt;code&gt;in_progress&lt;/code&gt; and does nothing.&lt;/p&gt;

&lt;p&gt;We also added an admin recovery endpoint for the case where a plan does get stuck in some unexpected state, usually because of a bug we have not seen yet. It manually transitions a plan back to a state where the user can retry. This sits in our admin tools and is not user-facing, but it has been useful exactly twice in the month since we shipped it, both for cases that taught us about new failure modes we then fixed properly. Operational tools earn their keep.&lt;/p&gt;

&lt;p&gt;Neither of these changes was the headline of the PR. They were both downstream consequences of taking the polling task seriously enough to read it line by line. That is its own lesson. Polling tasks tend to be the bit of code nobody reads. They are scheduled once when the feature is built and then they quietly run forever. The next time you find yourself in a polling fallback that nobody has touched in months, it is worth half an hour of your time to read the whole thing and ask whether the cadence still matches what users actually need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;The principle, in one sentence: poll the way the user feels the product, not the way the server feels the load. Almost everything else falls out of that.&lt;/p&gt;

&lt;p&gt;If you want to see what the rest of Codens looks like, the English landing page is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt; and our help docs (which include a lot more about how Green and Purple talk to each other) live at &lt;a href="https://help.codens.ai/en/" rel="noopener noreferrer"&gt;https://help.codens.ai/en/&lt;/a&gt;. The polling task discussed in this post lives in the open part of our backend; if you happen to spot a different case where this same trade-off applies, I would genuinely like to hear about it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>ux</category>
      <category>python</category>
    </item>
    <item>
      <title>"Killing the 5-MCP setup tax with one PyPI package and Device Code Flow"</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Thu, 07 May 2026 00:43:27 +0000</pubDate>
      <link>https://dev.to/zoetaka38/killing-the-5-mcp-setup-tax-with-one-pypi-package-and-device-code-flow-3hn4</link>
      <guid>https://dev.to/zoetaka38/killing-the-5-mcp-setup-tax-with-one-pypi-package-and-device-code-flow-3hn4</guid>
      <description>&lt;p&gt;A few weeks ago, onboarding a new user to Codens looked like this. They opened &lt;code&gt;.claude/settings.json&lt;/code&gt; and pasted five MCP server entries, one for each of our product surfaces. Then they ran five separate login commands. Five OAuth callbacks, five JWTs scattered across config files, five chances to typo a URL. The first time I walked someone through it on a call, they very politely said "this is a lot." They were right.&lt;/p&gt;

&lt;p&gt;We shipped &lt;code&gt;codens-mcp&lt;/code&gt; to fix it. One PyPI package, one MCP entry, one login. Thirty-one tools across Purple (the orchestrator), Red (auto-fix), Blue (E2E QA), Green (PRD), and Auth (shared SSO and billing). And because half our users live on remote dev boxes where browser-based OAuth is a pain, we added a Device Code Flow login that works over SSH, in containers, and inside GitHub Codespaces. This post is about the design choices, including the ones I'm still slightly unsure about.&lt;/p&gt;

&lt;p&gt;There's a particular kind of friction in a five-product onboarding that I want to name before we get into the architecture. It's not the time. The actual install commands take maybe four minutes. It's the suspicion. When you're being asked to paste five entries into a config file you don't fully trust yet, every step makes you wonder whether the product is going to be worth the setup. By the third login prompt, half of the users I watched gave up on configuring it correctly and just used Purple in isolation. The product loss from that wasn't a few users. It was the cross-product workflows that never got tried, because nobody ever saw them light up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one package, not five
&lt;/h2&gt;

&lt;p&gt;Codens grew the same way most multi-product systems grow: by accident. Purple shipped first. It had 16 MCP tools and a CLI called &lt;code&gt;purple-codens-mcp&lt;/code&gt;. People liked it. Then Red got its own MCP surface for bug reports and fix plans. Then Blue for test generation. Then Green for PRD consultations. Then Auth for signup and pricing lookups.&lt;/p&gt;

&lt;p&gt;Each of those products had its own backend, its own JWT, its own MCP server entry. From a code-organization standpoint that was fine. From a user standpoint it was a mess. People don't want to install five packages from PyPI. They want to install one thing and get all the tools.&lt;/p&gt;

&lt;p&gt;We considered three approaches:&lt;/p&gt;

&lt;p&gt;The first was: keep five packages, add a meta-package that depends on all of them. Clean dependency graph, but users still need five MCP server entries because each package exposes its own stdio binary. That solves nothing.&lt;/p&gt;

&lt;p&gt;The second was: collapse everything into &lt;code&gt;purple-codens-mcp&lt;/code&gt; and rename the package later. Tempting, but &lt;code&gt;purple-codens-mcp&lt;/code&gt; already had a userbase pinning &lt;code&gt;&amp;gt;=X.Y.Z&lt;/code&gt; in their lockfiles. Adding 15 new tools to that package would have been a stealth API expansion and the name would have been wrong forever.&lt;/p&gt;

&lt;p&gt;The third option, which is what we shipped: a new package called &lt;code&gt;codens-mcp&lt;/code&gt; that re-exports Purple's 16 tools and registers Red/Blue/Green/Auth tools alongside them. The package lives in a sibling directory (&lt;code&gt;purple-codens/codens-mcp/&lt;/code&gt;) and declares &lt;code&gt;purple-codens-mcp&lt;/code&gt; as a runtime dependency. Users who already have &lt;code&gt;purple-codens-mcp&lt;/code&gt; in production keep it working. New users install one thing. The bundling cost is one extra dependency in &lt;code&gt;pip list&lt;/code&gt;, which nobody is going to notice.&lt;/p&gt;

&lt;p&gt;The thing I like about this layout is that it keeps the auth code in exactly one place. Login logic lives in &lt;code&gt;purple_codens_mcp.auth&lt;/code&gt;. The new package imports that module rather than copying it. If a future Auth Codens migration changes the OAuth flow, I fix it in one file. The duplication trap is real and we deliberately walked around it.&lt;/p&gt;

&lt;p&gt;A subtler benefit of the bundled approach: the four product surfaces share a credential helper. Each tool that calls Red, Blue, or Green does so via a &lt;code&gt;_red_client(api_url)&lt;/code&gt; style accessor that reads the same credentials file Purple uses. There's no per-product login state to reconcile. If a token gets refreshed, everyone sees the new value on the next call. Earlier in the design I had each product's tools tracking auth independently, and the corner cases around expired tokens were the kind of thing I'd debug at midnight. Sharing one credential dict made those bugs go away because they couldn't exist in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cross-product registration tool
&lt;/h2&gt;

&lt;p&gt;Once all the tools sit in one package, you can write tools that span products. The first one we shipped is &lt;code&gt;codens_register_project_unified&lt;/code&gt;. It takes a GitHub repo and registers it across Purple, Red, Blue, and Green in a single call.&lt;/p&gt;

&lt;p&gt;Here's the design tension: Purple, Red, Blue, and Green each have their own database and their own &lt;code&gt;/api/v1/projects&lt;/code&gt; endpoint. There's no two-phase commit across them. So what happens when you call this tool and Green's API is having a bad afternoon?&lt;/p&gt;

&lt;p&gt;The transactional answer would be: roll everything back, fail loudly. But "rolling back" a project creation is annoying because some of those backends fire off webhooks and Slack notifications on creation, and reversing those side effects is messy. Worse, a user who wants to retry shouldn't be punished by having to delete three half-created projects manually.&lt;/p&gt;

&lt;p&gt;So we went best-effort. The tool returns a dict like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purple_project_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prj_a1b2c3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_project_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prj_d4e5f6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blue_project_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;green_project_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prj_g7h8i9&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;503 Service Unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If three out of four succeed, you get three IDs and a clearly-labeled failure for the one that didn't. The user (or the LLM driving the tool) can re-run the call with &lt;code&gt;products=["blue"]&lt;/code&gt; to retry just the failed one. That &lt;code&gt;products&lt;/code&gt; parameter defaults to all four, but accepting a subset turns out to be useful in two other situations: when a customer doesn't pay for one of the products yet, and when an LLM is exploring and only wants to register on Purple before committing to the rest.&lt;/p&gt;

&lt;p&gt;The honesty of the &lt;code&gt;errors&lt;/code&gt; array matters. Earlier drafts of the tool tried to be clever and aggregate failures into a single string. That made it harder for an agent to programmatically decide what to retry. The list-of-dicts shape is uglier in a logfile but trivially correct to parse.&lt;/p&gt;

&lt;p&gt;There's still one edge case I'm not happy with. If Purple succeeds but the network drops before Red is called, the user has a Purple project ID and no record that the tool was even attempted on the others. The current answer is "rerun with &lt;code&gt;products=["red", "blue", "green"]&lt;/code&gt;," which works, but it relies on the user noticing. A better long-term answer is probably an idempotency key sent to all four backends so retries dedupe naturally. That's on the list.&lt;/p&gt;

&lt;p&gt;One thing the cross-product tool does well: it sets the same &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;github_owner&lt;/code&gt;, and &lt;code&gt;github_repo&lt;/code&gt; everywhere. That sounds trivial but it was a real source of bugs in the five-package world, where users would type the repo name slightly differently in each product's MCP tool and end up with &lt;code&gt;acme/my-app&lt;/code&gt; registered in Purple and &lt;code&gt;acme/my_app&lt;/code&gt; in Green, and then wonder why the cross-product views showed nothing. Centralizing the registration call removes a whole class of typo-driven mismatch. The LLM driving the tool can't accidentally rename the repo halfway through, because there's only one place the repo name is supplied.&lt;/p&gt;

&lt;h2&gt;
  
  
  Device Code Flow, because SSH
&lt;/h2&gt;

&lt;p&gt;The classic OAuth flow that Codens shipped originally went like this. You ran &lt;code&gt;purple_login&lt;/code&gt; from inside Claude Code. The CLI started a tiny HTTP server on a random local port, opened your browser, you signed in with Google, the browser redirected to &lt;code&gt;http://localhost:54321/callback&lt;/code&gt;, the CLI captured the auth code, exchanged it for a JWT, stored the JWT, done.&lt;/p&gt;

&lt;p&gt;That works beautifully on a laptop. It falls apart in roughly half the environments our users actually work in.&lt;/p&gt;

&lt;p&gt;If you're SSH'd into a dev box, there's no browser to open, and even if there were, the redirect to &lt;code&gt;http://localhost:54321/callback&lt;/code&gt; would hit the wrong machine. Same story for dev containers, Docker exec sessions, and GitHub Codespaces. You can sometimes paper over it with port forwarding, but you have to remember to set that up before running the login command, and most people don't. They just see a hung CLI and a broken Google sign-in page.&lt;/p&gt;

&lt;p&gt;The fix is RFC 8628, the OAuth 2.0 Device Authorization Grant. It's the same flow that lets you sign into your TV's Netflix app by typing a short code on your phone. The CLI never opens a browser locally. It posts to the auth server and receives a &lt;code&gt;device_code&lt;/code&gt;, a short &lt;code&gt;user_code&lt;/code&gt;, and a verification URL. It prints the URL and the user code. You open the URL on whatever device has a browser, type the code, approve. The CLI is meanwhile polling a token endpoint every few seconds. The moment you approve, the next poll succeeds and the CLI gets a JWT.&lt;/p&gt;

&lt;p&gt;Running it looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;codens-mcp login
&lt;span class="go"&gt;Logging in to https://api.purple.codens.ai via https://api.auth.codens.ai ...

============================================================
  Device Authorization Required
============================================================

  1. Open this URL on any device:
     https://app.auth.codens.ai/device

  2. Enter this code when prompted:
     ABCD-1234

  Waiting for authorization (expires in 15 minutes)...
============================================================

  Authorization complete!

Logged in as you@example.com
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can sign in from your phone while SSH'd into a server. You can sign in from your laptop while pair-programming on a colleague's box. The CLI doesn't care where the browser is.&lt;/p&gt;

&lt;p&gt;The polling loop has one detail that's worth calling out. RFC 8628 says the auth server can return a &lt;code&gt;slow_down&lt;/code&gt; error to tell the client it's polling too aggressively. When that happens, the spec says the client must add at least 5 seconds to its polling interval. We honor that. It looks unimportant in code, but if you ignore it, a misbehaving client gets rate-limited and the user sees a login that just times out for no obvious reason. The spec is right; pay the 5 seconds.&lt;/p&gt;

&lt;p&gt;The other interesting bit: we share the credential file with &lt;code&gt;purple-codens-mcp&lt;/code&gt;. Both packages read and write &lt;code&gt;~/.purple-codens/credentials.json&lt;/code&gt;, mode 0600. One login authenticates all the Codens product backends, because Auth Codens is the SSO root and every product backend trusts JWTs it issues. If you have both packages installed, they coexist. If you have only &lt;code&gt;codens-mcp&lt;/code&gt;, the file is still at the same path, which means a user can install &lt;code&gt;purple-codens-mcp&lt;/code&gt; later and not need to log in again.&lt;/p&gt;

&lt;p&gt;Mode 0600 is the kind of thing that's easy to forget. Python's default &lt;code&gt;Path.write_text&lt;/code&gt; doesn't restrict permissions. We explicitly &lt;code&gt;chmod 0600&lt;/code&gt; after every write. If a credential file gets group-readable on a shared dev box, the JWT inside it is good for the token's full lifetime and there's no MFA prompt to slow an attacker down. The &lt;code&gt;chmod&lt;/code&gt; is one line; it should be one line in every credential-storing CLI.&lt;/p&gt;

&lt;p&gt;The polling timeout is set to 15 minutes, which is long enough to walk away from your terminal, find your phone, swear at the captcha, log in, and come back. We considered shorter values. The trade-off is that a 5-minute timeout looks more "responsive" but punishes the user who happened to be in a meeting when they ran the command. Fifteen minutes is what RFC 8628 suggests as a reasonable default and we didn't find a reason to argue with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subcommands, defaults, and not breaking anyone
&lt;/h2&gt;

&lt;p&gt;The CLI structure is &lt;code&gt;argparse&lt;/code&gt; with subparsers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codens-mcp &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;-h&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
codens-mcp login   &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--auth-url&lt;/span&gt; URL] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--api-url&lt;/span&gt; URL]
codens-mcp &lt;span class="nb"&gt;whoami&lt;/span&gt;  &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--api-url&lt;/span&gt; URL]
codens-mcp serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;serve&lt;/code&gt; starts the stdio MCP server. &lt;code&gt;login&lt;/code&gt; runs Device Code Flow. &lt;code&gt;whoami&lt;/code&gt; prints the email, user ID, organization, and remaining JPY credits for the currently authenticated user.&lt;/p&gt;

&lt;p&gt;The only mildly unusual choice is that &lt;code&gt;codens-mcp&lt;/code&gt; with no arguments runs &lt;code&gt;serve&lt;/code&gt;. That's deliberate. MCP servers in &lt;code&gt;.claude/settings.json&lt;/code&gt; are configured by command name. Existing entries written before the CLI got subcommands look like &lt;code&gt;"command": "codens-mcp", "args": []&lt;/code&gt;. If we'd made &lt;code&gt;serve&lt;/code&gt; mandatory, every existing config would break the moment users upgraded. So &lt;code&gt;parse_args()&lt;/code&gt; falls through to &lt;code&gt;_cmd_serve&lt;/code&gt; when &lt;code&gt;args.command&lt;/code&gt; is &lt;code&gt;None&lt;/code&gt;. The cost is that running &lt;code&gt;codens-mcp&lt;/code&gt; interactively in a terminal blocks on stdio, which feels weird if you're not expecting it, but the tradeoff is that 0.4.0 is a no-config-change upgrade.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;_cmd_login&lt;/code&gt; function is twelve lines. All it does is call &lt;code&gt;purple_codens_mcp.auth.device_code_login&lt;/code&gt;, then hand the resulting tokens to &lt;code&gt;PurpleCodensClient.login_with_device_token&lt;/code&gt;, which writes them to disk and fetches the user's profile. We deliberately don't reimplement the OAuth flow here. If we did, we'd have two copies of an RFC 8628 polling loop, and one of them would inevitably drift. By delegating, the unified package gets new auth features for free whenever &lt;code&gt;purple-codens-mcp&lt;/code&gt; ships them.&lt;/p&gt;

&lt;p&gt;This is also why we kept &lt;code&gt;purple-codens-mcp&lt;/code&gt; published as a separate package. It's the dependency target. It has the auth module, the client class, the credential storage. &lt;code&gt;codens-mcp&lt;/code&gt; builds on top of it. Some users with legacy automation still install &lt;code&gt;purple-codens-mcp&lt;/code&gt; directly and we don't want to break them. The deprecation path, if there ever is one, would be to make &lt;code&gt;purple-codens-mcp&lt;/code&gt; a thin shim that re-exports from &lt;code&gt;codens-mcp&lt;/code&gt;. But that's a future problem and probably not worth solving until the userbase says it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  What setup looks like now
&lt;/h2&gt;

&lt;p&gt;Three lines of config, two commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"codens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"codens-mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;codens-mcp
codens-mcp login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. One install. One login. Thirty-one tools available inside Claude Code: Purple's project and credit management, Red's bug analysis and fix plans, Blue's E2E test generation and execution, Green's PRD consultations and kickoff creation, Auth's signup and pricing lookups, plus the cross-product registration tool. The same JWT works against every backend because Auth Codens is the issuer and every product validates against the same public key.&lt;/p&gt;

&lt;p&gt;For users on a laptop, the original browser-based &lt;code&gt;purple_login&lt;/code&gt; MCP tool still works exactly as before, and it's slightly faster than Device Code Flow because there's no polling delay. We kept it. The CLI's &lt;code&gt;codens-mcp login&lt;/code&gt; is for the headless cases. Both flows write to the same credential file in the same format.&lt;/p&gt;

&lt;p&gt;The release timeline was tight. 0.1.0 went out on May 6 with the unified package and the 31 tools. 0.3.0 followed the same day with &lt;code&gt;codens_register_project_unified&lt;/code&gt;. 0.4.0 shipped on May 7 with the CLI subcommands and the no-arg &lt;code&gt;serve&lt;/code&gt; default. The semver minors reflect API additions, not breakage; every version is install-and-go from the previous one.&lt;/p&gt;

&lt;p&gt;If I were starting over, I'd do the unified package first and skip the five-package phase entirely. But the five-package phase is how we figured out which tools each product actually needed, and you can't skip that. The cleanup is the easy part. The hard part is knowing what to keep.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pointers
&lt;/h2&gt;

&lt;p&gt;Package on PyPI: &lt;a href="https://pypi.org/project/codens-mcp/" rel="noopener noreferrer"&gt;pypi.org/project/codens-mcp&lt;/a&gt;. Help docs with the canonical agent reference, including all 31 tool signatures: &lt;a href="https://help.codens.ai/en/" rel="noopener noreferrer"&gt;help.codens.ai/en/&lt;/a&gt;. Codens itself: &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;codens.ai/en/&lt;/a&gt;. If you've been hand-rolling MCP server entries for a multi-product setup, the takeaway is: bundle the package, share the credential file, and add Device Code Flow before the first user complains about SSH. The work is small. The friction it removes is not.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>claude</category>
      <category>python</category>
      <category>oauth</category>
    </item>
    <item>
      <title>How we ended up running one product with 2-3 people, after building our own dev harness</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Wed, 06 May 2026 12:59:27 +0000</pubDate>
      <link>https://dev.to/zoetaka38/how-we-ended-up-running-one-product-with-2-3-people-after-building-our-own-dev-harness-k2e</link>
      <guid>https://dev.to/zoetaka38/how-we-ended-up-running-one-product-with-2-3-people-after-building-our-own-dev-harness-k2e</guid>
      <description>&lt;p&gt;The thing nobody told me about agentic coding speeding up by 5x is that your bottleneck just moves. Code lands faster, sure. But now the PM is the slow part of the loop. The QA pass is the slow part of the loop. The "wait, is this what we actually wanted" conversation that used to take a day now takes four days, because you can ship three iterations in the time it takes to confirm what one of them was supposed to do.&lt;/p&gt;

&lt;p&gt;I run a small development shop in Tokyo. We do contract work for a handful of companies, and every one of them wants more shipping velocity than they have headcount for. About eighteen months ago I started taking AI coding seriously as a way to scale our own capacity. The Claude API plus a decent agent runner could plausibly make us much faster. So I started using it on real work, and within a few weeks the experience had a shape I didn't expect: the agent was great. The chain leading up to and following the agent was the problem.&lt;/p&gt;

&lt;p&gt;We ended up building our own harness around the agents, and the thing it bought us wasn't faster code. It was tighter loops. The punchline is "one product, one PM, one engineer, end-to-end," but the surprises along the way are the actually useful part.&lt;/p&gt;

&lt;p&gt;I'm building Codens, the harness in this story — happy to talk about it but the goal here is the build journey, not a pitch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coding got fast, then everything around coding got slow
&lt;/h2&gt;

&lt;p&gt;The first month using Claude Code seriously on a client project, our velocity on the implementation step went up roughly 3x. It was real, measured against a previous quarter of similar tickets, and it stayed.&lt;/p&gt;

&lt;p&gt;What also went up was the rate at which we shipped the wrong thing. Not "broken" wrong. The agent didn't write code that crashed. It wrote code that worked, looked clean, and implemented something subtly different from what the client meant when they wrote the ticket. Sometimes we'd merge, deploy, and only notice in the next sync. Sometimes the client noticed immediately and we'd have a rework conversation that ate two days of the four days we'd just saved.&lt;/p&gt;

&lt;p&gt;The pattern was always the same: a one-paragraph ticket like "users should be able to undo a deletion within 30 seconds" gets handed to the agent, the agent writes a perfectly reasonable interpretation, and the perfectly reasonable interpretation isn't what the client had in their head. Humans had this problem too. The difference was that humans took a day to write the wrong thing, so we had a day to ask "wait, undo from where? the trash, or in-place? does it survive a page refresh? what about cascading deletes?" and catch the ambiguity before we'd implemented it.&lt;/p&gt;

&lt;p&gt;The agent took twenty minutes. The clarifying conversation now had to happen &lt;em&gt;after&lt;/em&gt; the implementation, against a concrete artifact, which feels efficient until you realize you're throwing away an hour of agent work for every ambiguous ticket.&lt;/p&gt;

&lt;p&gt;So the first thing I built wasn't anything that touched the agent. It was a thing that took the rough one-paragraph request and turned it into a structured spec the human PM could review &lt;em&gt;before&lt;/em&gt; anyone wrote code. Inputs, outputs, edge cases, what's explicitly out of scope. Half a page, every time. The lift dropped from "stare at a blank page for an hour" to "react to a draft for ten minutes."&lt;/p&gt;

&lt;p&gt;This was the first real win. Not faster code, clearer asks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Once specs were structured, the chain extended downstream on its own
&lt;/h2&gt;

&lt;p&gt;Here's the thing I didn't see coming. Once the spec output was structured (actual fields, actual edge-case lists, actual out-of-scope statements), handing that spec to the coding agent was &lt;em&gt;also&lt;/em&gt; better. Way better. The agent stopped guessing what the ticket meant because the ticket no longer contained guesses.&lt;/p&gt;

&lt;p&gt;So the chain extended naturally. Rough request, structured spec the PM reviews, agent implements, PR. The handoff between spec and implementation became an artifact instead of a conversation, which meant nobody had to be on Slack at the same time for it to flow.&lt;/p&gt;

&lt;p&gt;The first version of the implementation runner was embarrassingly thin. A Python script that took the spec, shelled out to Claude Code in a worker, and posted the resulting PR URL to Slack. I deployed it on a small box in our office on a Saturday afternoon. Within a week we'd run something like 40 tickets through it. Within two weeks we hit our first real problem: every client had different rules, and the agent didn't know any of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Each product had different rules, and the agent didn't know any of them
&lt;/h2&gt;

&lt;p&gt;This is where the harness started to look less like a script and more like a system.&lt;/p&gt;

&lt;p&gt;One of our clients ran a fintech with strict patterns around money handling. Every price had to flow through a specific pricing module, no direct multiplication of cents-amounts allowed. Another had a Next.js codebase with a strong convention that all data fetching happens in server components and &lt;code&gt;'use client'&lt;/code&gt; is a code-review red flag. A third had a Django monolith with a deeply opinionated repository pattern.&lt;/p&gt;

&lt;p&gt;The agent, by default, would happily write code that violated all of these. It wasn't being malicious. It was writing code that worked. Code that worked just often happened to violate house rules.&lt;/p&gt;

&lt;p&gt;We started keeping per-project context: a &lt;code&gt;CLAUDE.md&lt;/code&gt; per repo with the rules, the patterns, the "don't ever do X" list, the "always check Y before Z" notes. The agent reads it on every run. Not a novel idea; it's what most teams using Claude Code do. What was novel for us was treating it as a &lt;em&gt;deliverable&lt;/em&gt; of every project setup. New client onboards, first week is "let's write your rules file together." Half discovery, half negotiation, all of it useful.&lt;/p&gt;

&lt;p&gt;That helped. It did not solve the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rules-don't-stick problem, and the gates we had to add on top
&lt;/h2&gt;

&lt;p&gt;The agent would read the rules file. The agent would acknowledge the rules file in its planning output. The agent would then, several tool calls later, write code that violated rule #4 because it was working through the immediate sub-problem and rule #4 was no longer in the foreground of its context.&lt;/p&gt;

&lt;p&gt;The first time this hurt us in production was on the fintech client. A migration ran, a price calculation got refactored as a "tidying" side-effect, and the calculation now bypassed the pricing module, went straight to multiplying two integers, because the agent's local fix made the code "cleaner." The PR passed review (the reviewer was tired, the diff looked clean, the math looked right). It hit prod. Nobody noticed for two days because the answer was right within rounding. We caught it during a billing reconciliation.&lt;/p&gt;

&lt;p&gt;If the rule is "all prices flow through &lt;code&gt;bcp_price()&lt;/code&gt;," that rule cannot be advisory. The agent's judgment is the wrong layer for it. The reviewer's judgment is the wrong layer for it. There has to be a deterministic check that the agent's diff respects the rule, and that check has to run before the PR is mergeable, and it cannot be something the agent can talk its way past.&lt;/p&gt;

&lt;p&gt;So we added gates. Real ones. Not "AI checks AI"; that has the same failure mode as the agent forgetting rule #4. Deterministic checks: shell commands, regex over the diff, AST walks for the more sensitive ones. The agent runs them, gets the failure output back, fixes the diff, runs them again. If they don't pass, the run errors out. The agent cannot decide the gate "doesn't apply here."&lt;/p&gt;

&lt;p&gt;The split that emerged, and which I now think is the right shape for any agentic workflow handling production code, is: the agent does the open-ended judgment work (what to write, how to structure it), and a deterministic step machine sequences and checks that work. If verify fails, the agent's next turn gets the failure output as input and tries again. If verify still fails after a few retries, a human gets paged. Open-ended cognition where it earns its cost; deterministic plumbing everywhere else.&lt;/p&gt;

&lt;p&gt;This was the second non-obvious lever. The first was "AI writes specs." The second was "AI judgment cannot be trusted to enforce rules; rules are a separate layer."&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests went into the same chain, almost as an afterthought
&lt;/h2&gt;

&lt;p&gt;We were already running our test suites in CI. What changed once the gates existed was that the agent itself started running tests as part of its loop, before opening a PR.&lt;/p&gt;

&lt;p&gt;The shape was: agent finishes implementing, harness runs the project's verify command (lint, typecheck, relevant test slice). If it fails, the agent gets the last ~1500 bytes of test output piped into its prompt, edits, retries. Up to three iterations, then gives up and asks for help.&lt;/p&gt;

&lt;p&gt;The retry loop sounds fancy and was actually trivial to implement once the gate infrastructure was there. The output of "tests failed" is structured enough that the agent can read it and produce a corrective edit roughly 70% of the time on the first retry, climbing past 90% by the third. I didn't believe the number when I first measured it. The third retry is doing real work; it's not just throwing more tokens at a stuck problem, it's correcting the second retry's overcompensation.&lt;/p&gt;

&lt;p&gt;The thing that surprised me was the team consequence. Once tests were running inside the agent's loop, our human review time on PRs dropped sharply. Not because the AI was doing review (it wasn't, and I'd argue it shouldn't) but because the agent was no longer opening PRs that were going to fail CI. The class of review where you spent twenty minutes reading a diff only to comment "tests are failing, please fix" disappeared. PRs that arrived in front of a human were PRs where the agent had already passed the gates the human cared about most. The conversation moved up the stack, to architecture and product fit, where humans should have been spending the time anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production bugs needed to flow back into the same pipe
&lt;/h2&gt;

&lt;p&gt;Around the time the test loop stabilized, we hit the next obvious thing: production bugs are also tickets. They have a different shape (stacktrace and reproduction context, not a one-paragraph product request) but they're tickets. Why aren't they entering the same pipeline?&lt;/p&gt;

&lt;p&gt;So we built a bug ingestion path. Sentry events, error reports, customer feedback that mentions a specific broken thing come in through a webhook, get analyzed for "is this actionable enough to attempt a fix automatically," and if yes, the harness opens a fix PR. Same retry loop as the implementation path.&lt;/p&gt;

&lt;p&gt;There's a public-facing piece for one of our clients: a feedback page where users describe broken things in natural language, which gets turned into structured bug reports, which gets fed into the analyzer. End to end, a user reporting "the export button doesn't work on Safari" can result in a fix PR sitting in front of an engineer about ninety seconds later. Most of the time the engineer doesn't merge it as-is. They read it, adjust, rerun, merge. But they're starting from a reasonable diff against a reproduced bug, not from "let me see if I can repro this on Safari first."&lt;/p&gt;

&lt;p&gt;Bug reports, feature specs, and refactor tickets all hit the same harness, all run through the same gate sequence, all produce the same kind of artifact. The PM and engineer don't have to context-switch between "we're doing features now" and "we're triaging bugs now." The queue is a queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rolling it out: the moment the team started using it without me
&lt;/h2&gt;

&lt;p&gt;For the first six months, the harness was something I personally ran. Tickets flowed in, I ran them, PRs came out. The other engineers used it occasionally but mostly worked the old way.&lt;/p&gt;

&lt;p&gt;The flip happened over about three weeks. One engineer started using it for the boring tickets, the ones that were "implement this CRUD endpoint exactly like the other three" type work. Their output on those weeks went up noticeably. Another engineer noticed and asked how to set it up for their project. By week three the harness was the default tool for incoming tickets across all four of our active clients.&lt;/p&gt;

&lt;p&gt;The thing that flipped it wasn't a feature. It was the moment the gates and the tests were reliable enough that an engineer could send a ticket through and trust the result enough to open the PR for review without re-doing the work themselves. Before that point, the harness was a curiosity. After that point, it was infrastructure.&lt;/p&gt;

&lt;p&gt;A small but important detail: we share an org-level credit pool for the LLM API calls across all the agents in the harness. No per-engineer budgeting. If you need to run a hard ticket through the implementation path eight times because the spec keeps shifting, you run it eight times. The shared pool means the cost conversation is a monthly business conversation, not a daily individual one. I think this matters more than people realize. Per-seat or per-run budgets create friction at exactly the wrong moment, the moment an engineer is deciding whether to use the tool or not.&lt;/p&gt;

&lt;p&gt;We also added a side-channel that observes the harness (every PR, every ticket, every gate failure, every retry) and computes activity signals per engineer per week. This wasn't supposed to be a productivity panopticon. It was "is the harness actually working" telemetry, the same way you'd watch a deploy pipeline. What it became, naturally, was a thing the engineers themselves used as a sanity check on their own work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where we landed: 2-3 people running a product end-to-end
&lt;/h2&gt;

&lt;p&gt;Today, three of our four active client products run with one PM and one engineer. The fourth has two engineers because it's a larger codebase mid-migration. The PM writes rough requests, the harness turns them into specs, the PM edits, the engineer reviews and approves the spec, the harness implements, gates run, tests run, PR lands, engineer reviews, merges. Bugs come in through the public feedback path, get analyzed, fix PRs land in front of the engineer.&lt;/p&gt;

&lt;p&gt;The engineer's day is mostly architecture decisions and prioritization conversations with the PM. The bulk of the implementation queue runs without their direct attention. They jump in when the gates fail in a way that needs human judgment, when a spec has a subtle ambiguity the PM didn't catch, when the bug analyzer surfaces a fix that's actually "this whole subsystem needs a rethink." Higher-leverage work, fewer bytes typed per day.&lt;/p&gt;

&lt;p&gt;The number that matters isn't "we 5x'd coding speed." At this point I don't even know how to measure that, because the question is malformed. The number that matters is iteration speed. From "client mentions a thing" to "thing is in production" used to be measured in days for us. It's now measured in hours for most of what we do, and the bottleneck is the conversation, not the work.&lt;/p&gt;

&lt;p&gt;That's the whole story. We didn't build a thing that codes faster. We built a thing that lets the conversation stay in front, where it should be, while the mechanical parts run. The mechanical parts include things you wouldn't have called mechanical five years ago: implementation, test fixing, bug triage. Most of those have a clear-enough shape that a deterministic step machine plus an agent in the open-ended slots gets you 80%+ of the way through, and the remaining 20% is exactly the work the humans were good at to begin with.&lt;/p&gt;

&lt;p&gt;If I had to compress the lesson, it's this. The first AI win is faster typing. Real teams quickly find that's not the bottleneck. The actual win is moving the slow conversation, the "what are we even building" conversation, back to the front of the loop, and giving the rest of the loop enough deterministic structure that humans only show up for the parts that need them. You don't get there by buying an AI coding tool. You get there by treating the whole loop as the design surface.&lt;/p&gt;

&lt;p&gt;Codens is the harness we built for ourselves and now offer to other teams (&lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;https://www.codens.ai/en/&lt;/a&gt;). Other shops will solve this differently. But if you're a small team feeling like the AI made you faster at typing without making you faster at shipping, the gap is probably somewhere I described above. Worth looking at the parts of the loop that aren't typing.&lt;/p&gt;

&lt;p&gt;Closing thought. The iteration speed thing keeps surprising me even now. We've been operating at the new tempo for about six months. Every time a client says "can you have this by Friday" and we ship Wednesday, I notice a small flicker of "is that actually going to be okay long-term." Not because of the code but because of what the new tempo does to the conversation rhythm. Our PMs have had to learn to think faster about what they want, because the implementation lag is no longer a built-in pause for reflection. That's a real cost. I don't think it outweighs the benefit, but it's the part of this transition I most underestimated. The harness made the mechanics fast. It also pushed all the unfastness onto the humans who decide what to build, which is, on net, where I want it. But it doesn't feel free.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>devops</category>
      <category>saas</category>
    </item>
    <item>
      <title>Codens vs Devin vs Cursor Composer vs Sweep — picking the AI coding agent that matches your bottleneck</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Wed, 06 May 2026 09:19:16 +0000</pubDate>
      <link>https://dev.to/zoetaka38/codens-vs-devin-vs-cursor-composer-vs-sweep-picking-the-ai-coding-agent-that-matches-your-2hoe</link>
      <guid>https://dev.to/zoetaka38/codens-vs-devin-vs-cursor-composer-vs-sweep-picking-the-ai-coding-agent-that-matches-your-2hoe</guid>
      <description>&lt;p&gt;I get asked "how is Codens different from Devin / Cursor / Sweep" enough that I want to write the honest version once. The short answer is that these four tools live on different parts of the development lifecycle, and most of the people asking the question are actually trying to figure out which bottleneck they have — not which product is "best." So this is the comparison I'd want if I were on the buying side.&lt;/p&gt;

&lt;p&gt;Quick disclaimer up top: I'm building Codens, the harness in this comparison — happy to talk about it but the goal here is genuine comparison, not pitch. Where the others are stronger I'll say so, and where Codens has weaknesses (smaller customer base, opinionated workflow, JP-first market) I'll call them out in the same paragraph as the strengths.&lt;/p&gt;

&lt;h2&gt;
  
  
  What each one is solving for
&lt;/h2&gt;

&lt;p&gt;Before any feature comparison, the more useful framing is: what problem was each of these tools designed around? Because the design decisions that follow — pricing, surface area, where the user clicks first — all flow from that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Devin (Cognition)&lt;/strong&gt; is solving "I want to assign a software engineering ticket and have it done without me sitting at the keyboard." It's an autonomous agent with its own dev environment, runs in the cloud, and the interaction model is closer to "delegating to a remote teammate" than to using a tool. You give it a task, it works for hours, you come back to a PR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor Composer (Anysphere)&lt;/strong&gt; is solving "I am at the keyboard right now and I want the inner loop of writing code to be faster." Cursor is an AI-first IDE forked from VS Code, and Composer is its multi-file edit agent that lives inside the editor. The whole UX assumes a developer is present, watching the diffs as they land, accepting and rejecting suggestions in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sweep (sweep.dev)&lt;/strong&gt; is solving "I have GitHub issues that should be small PRs and I don't want a human to do them." The trigger is filing or labeling an issue; the output is a PR; the whole flow is GitHub-native and async.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codens (us)&lt;/strong&gt; is solving "the SDLC has at least five distinct bottlenecks and I want a harness that handles all of them with specialized agents that share state and a budget." The entry point isn't an IDE or a GitHub issue — it's a Notion ticket written by anyone in the company, including non-engineers. From there an orchestrator agent routes the work, and other specialized agents (PRD writer, error auto-fix, test gen, activity ledger) cover the rest of the loop.&lt;/p&gt;

&lt;p&gt;If you read those four sentences carefully you'll notice they don't really overlap. They share buzzwords — "AI agent," "writes code," "opens PRs" — but the problems are genuinely different. That's why "which one wins" isn't the right question.&lt;/p&gt;

&lt;h2&gt;
  
  
  At-a-glance comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Devin&lt;/th&gt;
&lt;th&gt;Cursor Composer&lt;/th&gt;
&lt;th&gt;Sweep&lt;/th&gt;
&lt;th&gt;Codens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary entry point&lt;/td&gt;
&lt;td&gt;Web UI / Slack ticket&lt;/td&gt;
&lt;td&gt;IDE (Cursor editor)&lt;/td&gt;
&lt;td&gt;GitHub issue&lt;/td&gt;
&lt;td&gt;Notion ticket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where it runs&lt;/td&gt;
&lt;td&gt;Cloud (own dev env)&lt;/td&gt;
&lt;td&gt;Local (your IDE)&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;Cloud worker per agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single-agent vs multi-agent&lt;/td&gt;
&lt;td&gt;Single autonomous agent&lt;/td&gt;
&lt;td&gt;Single in-editor agent&lt;/td&gt;
&lt;td&gt;Single agent&lt;/td&gt;
&lt;td&gt;5 specialized agents + orchestrator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineer required to operate&lt;/td&gt;
&lt;td&gt;No (designed for delegation)&lt;/td&gt;
&lt;td&gt;Yes (it's an IDE)&lt;/td&gt;
&lt;td&gt;No (issue-driven)&lt;/td&gt;
&lt;td&gt;No (Notion ticket entry)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-engineer entry&lt;/td&gt;
&lt;td&gt;Yes, via ticket&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Limited (must file issue)&lt;/td&gt;
&lt;td&gt;Yes, designed for it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Billing model&lt;/td&gt;
&lt;td&gt;Subscription + usage / "ACUs"&lt;/td&gt;
&lt;td&gt;Per-seat subscription&lt;/td&gt;
&lt;td&gt;Per-PR / subscription tiers&lt;/td&gt;
&lt;td&gt;Org-wide credit pool shared across agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runs offline / on your hardware&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Editor local, AI calls go out&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No (editor is closed)&lt;/td&gt;
&lt;td&gt;Has open-source heritage&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best at&lt;/td&gt;
&lt;td&gt;End-to-end ticket completion&lt;/td&gt;
&lt;td&gt;Fast in-editor coding&lt;/td&gt;
&lt;td&gt;Issue-to-PR async&lt;/td&gt;
&lt;td&gt;Covering the SDLC end-to-end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honest weakness&lt;/td&gt;
&lt;td&gt;Opaque mid-task; expensive&lt;/td&gt;
&lt;td&gt;Engineer-bound; no async&lt;/td&gt;
&lt;td&gt;GitHub-only; narrow scope&lt;/td&gt;
&lt;td&gt;Smaller customer base; opinionated; JP-first&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A note on this table: every row is verifiable from each product's public marketing or docs at the time of writing. I've tried not to infer beyond what they say themselves. The "honest weakness" column is my read, not theirs — but it's the trade-off I'd want flagged if I were choosing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases — which one to actually pick
&lt;/h2&gt;

&lt;p&gt;This is where I think the comparison gets useful. Most engineering orgs I talk to have one of four bottlenecks, and each of these tools maps cleanly to one of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Devin if your bottleneck is "I have well-scoped tickets and not enough engineers."&lt;/strong&gt; Devin's design is genuinely good for the case where you have a backlog of issues that are each maybe a half-day to a full day of work for a mid-level engineer, and you'd like them done while you sleep. The trade-off you accept is opacity — Devin works for hours, and you don't really get to peek inside until it's done. If the task ends up being underspecified or off-track you find out at the end. That's fine for some workflows and brutal for others.&lt;/p&gt;

&lt;p&gt;The other Devin trade-off is cost. The pricing is structured around usage units and the rate per unit assumes you value engineer-equivalent hours, so it isn't a "let's see how it does on a tiny project" kind of purchase. If you have the volume, it's a real lever. If you don't, the per-task math gets uncomfortable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Cursor if your bottleneck is "my engineers are spending too much time on the keyboard, not on thinking."&lt;/strong&gt; Cursor Composer is excellent at the inner loop — you're refactoring across five files, you describe the change, it does the multi-file edit, you review the diff in your editor and accept it. The feedback cycle is tight and you stay in flow. This is the right tool when the human is in the loop and the goal is to make that human faster, not to remove them.&lt;/p&gt;

&lt;p&gt;The trade-off Cursor makes is exactly the inverse of Devin's: it's engineer-bound. There is no "delegate this and come back tomorrow" mode. There's also no shared organizational budget — billing is per-seat, which is great for a team of eight engineers and starts to feel weird if you want non-engineers to occasionally trigger work too. (They mostly can't, because they're not in the IDE.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Sweep if your bottleneck is "we have a long tail of small GitHub issues that nobody wants to do."&lt;/strong&gt; This is a real bottleneck for a lot of OSS projects and for internal repos with a lot of papercut tickets. Sweep's GitHub-native trigger is genuinely the right design here — file the issue, label it, and the agent picks it up. The async nature means it doesn't block anyone.&lt;/p&gt;

&lt;p&gt;The honest Sweep trade-off is scope. It's a single-agent system targeting one specific moment in the lifecycle (issue → PR). If your bottleneck isn't issue-to-PR specifically — if it's PRD writing, or error response, or test coverage — Sweep doesn't address it, and adding more single-agent tools to cover those gaps gets you a fragmented stack with three different billing dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Codens if your bottleneck is "the whole SDLC is leaky and we want one harness to cover the leaks."&lt;/strong&gt; Codens is built around the idea that there isn't a single bottleneck — there's PRD quality, there's response time on production errors, there's test coverage that decays, there's the "what did engineering even ship this week" reporting question. Each is its own agent, but they share a credit pool and they share organizational state, so an error caught by the auto-fix agent can become a ticket the PRD agent enriches and the orchestrator schedules.&lt;/p&gt;

&lt;p&gt;The honest Codens trade-offs: we're newer (fewer customer references), we're opinionated (the workflow assumes Notion as the entry point and won't bend much on that), and we're JP-first (most of the existing customers are Japanese, the docs landed in Japanese first, and the EN side is catching up). If those bother you, one of the other three is probably a better fit. If you want a single harness that covers more than one stage of the loop, the per-stage alternatives don't really exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I chose a multi-agent harness for Codens
&lt;/h2&gt;

&lt;p&gt;This is positioning, not pitch — but it's worth explaining because it's the design decision that makes Codens look most different from the other three.&lt;/p&gt;

&lt;p&gt;When I started Codens, the obvious option was to build "Devin but cheaper" or "Sweep but multi-language" or "Cursor but in the cloud." Each is a defensible product. I went a different direction because of one observation: the people I was building for weren't lacking a coding agent. They had Cursor, or Copilot, or Claude in the IDE. What they were lacking was a workflow that connected the bits.&lt;/p&gt;

&lt;p&gt;Specifically: a PRD got written in a Google Doc, copy-pasted into a Notion ticket, partially translated into a GitHub issue, picked up by an engineer using an in-IDE AI tool, shipped, broke in production, the error landed in Sentry, somebody manually opened a ticket about it, and three weeks later the team retro asked "wait, what did we even ship this quarter?" Each link in that chain had a tool. But the chain itself had no harness.&lt;/p&gt;

&lt;p&gt;So the bet I made is that the value isn't in any single agent being the best — Cursor will probably always be a better in-IDE editor than anything I build, Devin will probably always be a better autonomous engineer for a single hard ticket — the value is in the harness that owns the chain. Five specialized agents, each narrower in scope than a generalist autonomous agent, but coordinated, sharing state, sharing a credit pool.&lt;/p&gt;

&lt;p&gt;The trade-off of this choice, which I want to be honest about: a harness is a heavier sell than a tool. "Try this in your editor" is a five-minute decision. "Adopt this as the way your org routes work" is a six-week one. I knew that going in, and I think it's the right bet for this product, but I'm not going to pretend the GTM is as easy as Cursor's.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing models compared
&lt;/h2&gt;

&lt;p&gt;Pricing is the part where the design philosophies show up most clearly, and it matters because the wrong pricing model for your usage shape can be a 3-5x difference in real cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Devin&lt;/strong&gt; charges by usage units (often called ACUs in their pricing) on top of a subscription floor. The mental model is "pay for engineering hours equivalent." This is honest pricing — a Devin task is genuinely doing work that would have been an engineer-hour — but it's only economical if you have task volume that justifies the rate. If you'd be using it five times a month, the per-task math doesn't favor you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor&lt;/strong&gt; is per-seat. This is the IDE-tool playbook and it's the cleanest pricing of the four. Every engineer who codes pays a flat fee. If your team is twelve engineers, you pay for twelve seats. The downside is you're not paying for outcomes, you're paying for access — so if half your engineers barely use Composer, you're still paying for them, and if non-engineers want to trigger AI work occasionally, there's no clean model for that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sweep&lt;/strong&gt; has tier-based subscriptions plus per-PR concepts. The async issue-to-PR shape lends itself to "you pay roughly per output," which is a clean unit economically. The tiers add some predictability on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codens&lt;/strong&gt; uses an org-wide credit pool. One number for the whole organization, drawn down by every agent — error auto-fix, PRD agent, test gen, all of them. The intent is that you're not separately budgeting "how many auto-fixes per month" and "how many test generations per month"; you're budgeting "how much AI work this org does this month," and the agents share that pool.&lt;/p&gt;

&lt;p&gt;The honest trade-off of the pool model: it's harder to predict in month one because you don't yet know which agents you'll lean on. By month three the shape settles and the pool is the cleanest model for a multi-agent setup, but the first 30 days require some attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the actual choice you're making
&lt;/h2&gt;

&lt;p&gt;If you've read this far you probably already see the shape, but let me make it explicit because it's how I'd reason about it if I were on the buying side.&lt;/p&gt;

&lt;p&gt;The choice isn't "which AI agent is best." The choice is &lt;strong&gt;which axis of automation matters most for you right now&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the axis is &lt;strong&gt;engineer throughput on well-scoped tickets&lt;/strong&gt;, you want autonomous task completion. That's Devin.&lt;/li&gt;
&lt;li&gt;If the axis is &lt;strong&gt;engineer speed at the keyboard&lt;/strong&gt;, you want in-IDE multi-file editing. That's Cursor Composer.&lt;/li&gt;
&lt;li&gt;If the axis is &lt;strong&gt;closing out small GitHub issues without human intervention&lt;/strong&gt;, you want issue-to-PR async. That's Sweep.&lt;/li&gt;
&lt;li&gt;If the axis is &lt;strong&gt;the whole loop from PRD to error response&lt;/strong&gt;, you want a multi-agent harness. That's where Codens fits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also stack these. Cursor in the IDE plus Sweep on the GitHub side plus Codens for the multi-stage harness is a real combination — they don't conflict because they're operating on different surfaces. The combination most people don't do is "Devin plus Codens," because both of those want to own the work-routing layer, and you'd be paying twice for that.&lt;/p&gt;

&lt;p&gt;The combination that I'd push back on: stacking three single-agent tools to try to recreate a harness. It's tempting because each individual tool is cheap to try. But you end up with three billing dashboards, three logging surfaces, three separately-budgeted credit pools that can't share, and no shared state between the agents. That's the fragmentation problem the harness model is meant to fix, and you can't really stack your way out of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing — peer note
&lt;/h2&gt;

&lt;p&gt;If you're evaluating any of these, my honest peer advice is this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify your actual bottleneck before the demo. "AI coding agents" is a category; the four products in this article solve four different problems. Showing up to a demo without knowing which problem you have is how you end up with the most expensive tool that doesn't fit.&lt;/li&gt;
&lt;li&gt;Run a two-week pilot, not a 30-minute demo. All four of these tools demo well. The thing that matters is what the integration looks like in your team's actual workflow with your actual ticket shape, and that takes longer than 30 minutes to surface.&lt;/li&gt;
&lt;li&gt;Take the per-seat vs per-task vs credit-pool pricing math seriously. The wrong model isn't "more expensive than I thought" — it's a misalignment with how your team actually generates work, and it'll show up as friction every month.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'm biased because I built Codens — I've said that. But the fairest version of this comparison is the one where Cursor wins on inner-loop coding, Devin wins on autonomous task completion, Sweep wins on issue-to-PR, and Codens wins on covering the whole loop. Those are different wins. Pick the one you actually need.&lt;/p&gt;

&lt;p&gt;If a multi-agent harness is the shape that fits your bottleneck and you want to dig in, the EN landing page is at &lt;a href="https://www.codens.ai/en/" rel="noopener noreferrer"&gt;codens.ai/en&lt;/a&gt;. If one of the other three is the better fit, I'd genuinely rather you use that one — a misfit harness is worse than a well-fit point tool.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>claude</category>
      <category>productivity</category>
    </item>
    <item>
      <title>We shipped a pricing page that was 50,000x off from the meter — a SaaS pricing post-mortem</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Wed, 06 May 2026 09:16:43 +0000</pubDate>
      <link>https://dev.to/zoetaka38/we-shipped-a-pricing-page-that-was-50000x-off-from-the-meter-a-saas-pricing-post-mortem-2ade</link>
      <guid>https://dev.to/zoetaka38/we-shipped-a-pricing-page-that-was-50000x-off-from-the-meter-a-saas-pricing-post-mortem-2ade</guid>
      <description>&lt;p&gt;I noticed it mid-conversation. A potential customer was reading our pricing page out loud — "Pro plan, ¥50,000 a month, 100 credits" — and asked, "so each credit is about ¥500?" I opened the backend. Our &lt;code&gt;RateCard&lt;/code&gt; row had &lt;code&gt;input_token_rate: 1.5&lt;/code&gt;, &lt;code&gt;output_token_rate: 7.5&lt;/code&gt;, and the implicit unit was &lt;em&gt;one credit equals ¥0.01&lt;/em&gt;. A typical repair task burns 80,000–125,000 credits. On the backend, ¥50,000 buys you five million credits. On the LP, ¥50,000 was being read as five hundred yen per credit.&lt;/p&gt;

&lt;p&gt;The gap was 50,000x. Same word — "credit" — pointing at two different products.&lt;/p&gt;

&lt;p&gt;I'm building an AI dev harness called Codens; the relevant context here is that the billing meter is real (PostgreSQL row-level enforcement, a &lt;code&gt;RateCard&lt;/code&gt; table, idempotent &lt;code&gt;consume_credit&lt;/code&gt; calls from worker pods), and the marketing copy was written months earlier when "credit" meant something else. The implementation moved. The copy didn't. Nobody re-checked.&lt;/p&gt;

&lt;p&gt;This is the post-mortem. What the SSOT actually said, what the LP was actually saying, how the two diverged silently, and what we shipped to re-align them. I'm writing it because I suspect this exact failure mode is more common than people admit — pricing copy that drifts from the meter is the kind of bug that doesn't show up in CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the SSOT lived (and what it said)
&lt;/h2&gt;

&lt;p&gt;The single source of truth for our pricing is two things, both in the Billing Control Plane (BCP) database:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;billing_rate_cards&lt;/code&gt; table — token-to-credit conversion rates, model multipliers, minimum charge.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;plans&lt;/code&gt; table — tier definitions: monthly credit quotas and their JPY prices.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The rate card is enforced at consume time. Every &lt;code&gt;POST /credits/deduct&lt;/code&gt; call goes through &lt;code&gt;ConsumeCreditUseCase._price&lt;/code&gt;, which does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ConsumeCreditRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;card&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate_card_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_current&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;NotFound&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no active rate card&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_model_multiplier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;in_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_input_token_rate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;multiplier&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;quantize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_output_token_rate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;multiplier&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;quantize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;out_cost&lt;/span&gt;
    &lt;span class="n"&gt;minimum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minimum_charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;minimum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;minimum&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;code&gt;quantize(Decimal("0.01"))&lt;/code&gt;. That's not a coincidence. Credits are denominated to two decimal places; a credit, in this code, is a one-yen unit at one cent precision. The seed migration is explicit about it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;INITIAL_RATES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_token_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_token_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;by_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1 input token costs 1.5 credits (multiplied by model factor). 1 output token costs 7.5 credits. A typical Sonnet repair task — 50K input tokens, 20K output tokens — comes to &lt;code&gt;50000 * 1.5 + 20000 * 7.5 = 225,000&lt;/code&gt; credits. With internal corrections and the 3-Retry loop, real-world averages land at 80K–125K credits. At ¥0.01 per credit, that's ¥800–¥1,250 per task in raw token cost.&lt;/p&gt;

&lt;p&gt;That's the meter. It was correct. It was deployed. It was charging real customers correctly.&lt;/p&gt;

&lt;p&gt;Then the &lt;code&gt;plans&lt;/code&gt; table sits on top:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PLANS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;free&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Free&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hobby&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hobby&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;300_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;business&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Business&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enterprise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enterprise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Format: &lt;code&gt;(tier, display_name, credit_quota, price_jpy, seat_limit)&lt;/code&gt;. So Pro is 1,000,000 credits/month for ¥10,000. Business is 5,000,000 credits for ¥50,000. The implied price is ¥0.01 per credit, exactly matching the rate card's quantize precision. The two tables are internally consistent. The DB knows what it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  What marketing was saying
&lt;/h2&gt;

&lt;p&gt;Now here's what the LP said before the alignment commit. I'll quote the actual diff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!-- BEFORE --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"text-4xl font-bold text-gray-900 mb-1"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;$333&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"text-gray-500 text-sm"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;/month (excl. tax)&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;&amp;lt;!-- ... --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;span&amp;gt;&amp;lt;strong&amp;gt;&lt;/span&gt;100 credits/month&lt;span class="nt"&gt;&amp;lt;/strong&amp;gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pro plan: $333/month, 100 credits/month. There was no Hobby tier. There was no mention of token-level metering. The "What is a Credit?" section said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;1 credit ≈ 1 AI repair or generation task. One credit is the unit consumed when AI completes one task. Depending on task complexity, 0.5–3 credits may be consumed per operation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A reader doing the obvious math gets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$333 / 100 credits = $3.33 per credit&lt;/li&gt;
&lt;li&gt;At ¥150/USD: roughly ¥500 per credit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then they go on to use the product. The first repair task fires. The meter charges them ~100,000 credits. They have ~999,900 credits left of an "100 credits/month" plan. The numbers on their dashboard make no sense relative to the LP. Either we're charging them 1000x what we said, or "credit" doesn't mean what the LP said it meant.&lt;/p&gt;

&lt;p&gt;The latter, of course. But you can't tell which from the LP. Same word, two products.&lt;/p&gt;

&lt;p&gt;The Business tier was worse. LP said $1,000/month for 500 credits. Implied unit price: $2/credit. Backend: 5,000,000 credits for ¥50,000 ≈ $333. So the LP wasn't just unit-confused — it was 3x more expensive than the meter, &lt;em&gt;and&lt;/em&gt; implying a unit price 50,000x the actual one. Two errors stacked.&lt;/p&gt;

&lt;p&gt;The exact arithmetic in the commit message is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Old LP: Pro $333 = 100 credits (implied $3.33/credit ≈ ¥500/credit)&lt;br&gt;
Backend SSOT: ¥0.01/credit (DB credit_packages: ¥1,000 = 100,000 credits)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;¥500 vs ¥0.01. That's the 50,000x.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the gap formed
&lt;/h2&gt;

&lt;p&gt;Reconstructing the timeline, here's the most charitable version of what happened. Earlier in the project, "credit" did mean "one AI task." That was the original product copy: "you get N tasks per month, each task = 1 credit." It was a clean abstraction for a landing page. It was also wrong as soon as the implementation moved to token-level metering, which it did when we needed actual cost-tracking accuracy. A "task" varies wildly — a small bug fix is 30K tokens, a multi-file refactor is 200K. Charging a flat 1 credit for both was either over-charging the small ones or eating the large ones.&lt;/p&gt;

&lt;p&gt;So the meter migrated to per-token. The rate card got the 1.5/7.5/multiplier shape. The &lt;code&gt;consume_credit&lt;/code&gt; use case started quantizing to ¥0.01. Internal numbers all aligned around the new unit.&lt;/p&gt;

&lt;p&gt;The LP was never updated.&lt;/p&gt;

&lt;p&gt;There's no single moment where the divergence "happened." It happened in the absence of a moment — the absence of a re-check pass triggered by the implementation change. The feature flag was "we are now metering by token." The follow-up work was "verify the meter, fix any consumer code that depended on the old unit, ship." Marketing copy lives in a separate repo (&lt;code&gt;www-codens&lt;/code&gt;), is owned by a separate part of my brain, and isn't a "consumer of the new unit" in any tracked sense. Nothing in CI catches "the LP says one thing, the database says another." The two systems never have to agree because they never read each other.&lt;/p&gt;

&lt;p&gt;The general shape: &lt;strong&gt;pricing copy that doesn't gate on the engineering SSOT will drift the moment the SSOT moves.&lt;/strong&gt; This is the same class of bug as comments diverging from code, except the consequences are a customer trust failure rather than a code-review nuisance. A customer who computes ¥500/credit and then sees credits drain by the hundred thousand has been lied to, regardless of intent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting the drift
&lt;/h2&gt;

&lt;p&gt;The detection event was unglamorous. I was on a call with a potential customer. They asked the unit-price question. I said, "let me pull up the actual numbers," opened the BCP DB and the LP side by side, and stared.&lt;/p&gt;

&lt;p&gt;A few things made the gap diagnosable in seconds once both were on screen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The LP showed &lt;code&gt;100 credits/month&lt;/code&gt;. The DB showed &lt;code&gt;1_000_000&lt;/code&gt;. Three orders of magnitude off, just on the quota number.&lt;/li&gt;
&lt;li&gt;The LP showed &lt;code&gt;$333&lt;/code&gt; for Pro. The DB showed &lt;code&gt;¥10,000&lt;/code&gt; (~$67). Five times off, in the other direction.&lt;/li&gt;
&lt;li&gt;The "What is a Credit?" copy said "1 credit ≈ 1 task." The rate card said &lt;code&gt;input_token_rate: 1.5&lt;/code&gt;. The unit on the LP didn't even share dimensions with the unit in the meter.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The compound effect — wrong quota, wrong price, wrong unit definition — is what produced 50,000x. Each individual error was a typical 5–1000x drift; stacked, they multiplied.&lt;/p&gt;

&lt;p&gt;The thing I want to call out: there was no monitoring that would have surfaced this. The meter was working correctly. Customer balances were tracking correctly against actual usage. CI was green. The bug was entirely between two documents that had no mechanical relationship — the rendered HTML and the seed migration. Either could be edited without touching the other and nothing would alert.&lt;/p&gt;

&lt;p&gt;The git commit message I ended up writing pulls no punches:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;fix(lp): align pricing page with backend SSOT (5 tiers, 1 credit = ¥0.01)&lt;/p&gt;

&lt;p&gt;Marketing pricing was 5万倍ズレ vs implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old LP: Pro $333 = 100 credits (implied $3.33/credit ≈ ¥500/credit)&lt;/li&gt;
&lt;li&gt;Backend SSOT: ¥0.01/credit (DB credit_packages: ¥1,000 = 100,000 credits)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;"5万倍ズレ" is "50,000x off." I left it in Japanese in the commit because the visceral version of "we shipped pricing this wrong" reads better in the language I was thinking in when I caught it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Re-alignment: what we shipped
&lt;/h2&gt;

&lt;p&gt;The alignment commit (&lt;code&gt;d082997&lt;/code&gt;) did three things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One: restructure tiers to match the seed table.&lt;/strong&gt; The LP went from a 4-tier structure (Free Trial / Pro / Business / Enterprise) to a 5-tier structure (Free Trial / Hobby / Pro / Business / Enterprise) matching the &lt;code&gt;plans&lt;/code&gt; migration. Quotas and prices were copied directly from the seed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!-- AFTER --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"text-4xl font-bold text-gray-900 mb-1"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;$67&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"text-gray-500 text-sm"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;/month (excl. tax) · ¥10,000&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;&amp;lt;!-- ... --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;span&amp;gt;&amp;lt;strong&amp;gt;&lt;/span&gt;1,000,000 credits/month&lt;span class="nt"&gt;&amp;lt;/strong&amp;gt;&lt;/span&gt; (~12-13 repair tasks)&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;code&gt;(~12-13 repair tasks)&lt;/code&gt; annotation. That's the bridge between the meter unit (credits) and the human unit (tasks). The LP isn't pretending one credit equals one task anymore; it's giving you both numbers so you can do your own arithmetic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two: rewrite the "What is a Credit?" section to be honest about per-token pricing.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;h3&amp;gt;&lt;/span&gt;1 credit = ¥0.01 (~$0.000067)&lt;span class="nt"&gt;&amp;lt;/h3&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;Pricing scales with actual token usage: 1 input token = 1.5 credits,
   1 output token = 7.5 credits, with model multiplier (Haiku 0.2x /
   Sonnet 1.0x / Opus 5.0x). Task consumption varies by complexity
   and retry count.&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the section I was most reluctant to rewrite. It surrenders the clean "1 credit ≈ 1 task" abstraction in exchange for technical accuracy. The trade is correct — accuracy beats simplicity when the simplicity is a lie — but it does make the LP harder to skim. The mitigation is the per-product example cards underneath: "Error Auto-Fix ≈ 80K-125K credits per task (¥800-1,250)." The reader gets the technical truth and the human-scale example side by side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three: fix the trial bonus to match a usable workload.&lt;/strong&gt; The old trial was "2 credits/day (28 total)." Under the new (correct) unit, 28 credits is enough to consume about 0.0001% of one task. It would have been a defective trial.&lt;/p&gt;

&lt;p&gt;The fixed trial: &lt;strong&gt;30,000 credits over 14 days&lt;/strong&gt;. That's roughly 0.4 of one repair task — enough to start a task, see the orchestration, and (in many cases) finish a small one. Still not generous, but in the right order of magnitude. The "free trial = enough to do one real thing" rule was something I should have written down as a constraint earlier; without it, the old "28 total credits" had no anchor in real usage.&lt;/p&gt;

&lt;p&gt;The top-up packages got the same alignment treatment, with prices matching the DB seed exactly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;¥1,000 → 100,000 / ¥5,000 → 525,000 (+5%) / ¥10,000 → 1,100,000 (+10%) / ¥50,000 → 5,750,000 (+15%)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The bonuses (5/10/15%) are a separate decision; what matters here is that the four package sizes on the LP correspond byte-for-byte to four rows in the database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons for SaaS pricing as engineering
&lt;/h2&gt;

&lt;p&gt;A few things I'm taking from this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing copy is downstream of the SSOT, not an upstream input.&lt;/strong&gt; The bug existed because we treated the LP as a parallel artifact — written once, edited occasionally, sourced from "what we want to charge" rather than "what the meter actually does." Once a meter exists, the LP is a &lt;em&gt;projection&lt;/em&gt; of the meter's state. If the meter's tiers are in a &lt;code&gt;plans&lt;/code&gt; migration, the LP's tier cards have to be generated from (or at minimum cross-checked against) that migration. We're not yet generating the LP from the seed — that's the next step — but the alignment commit does at least make the values copy-pasted rather than independently invented.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"1 unit of consumption" framings need a meter that proves the unit.&lt;/strong&gt; The "1 credit = 1 task" framing was honest at the moment we wrote it (when the meter was per-task) and dishonest the moment the meter moved (per-token). The general principle: any pricing-copy phrase of the form "X = Y" should be falsifiable against a query against the production database. If the LP says "100 credits per month" and the database disagrees, the database is correct and the LP is a bug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free trial size should match one full task, not "N credits/day."&lt;/strong&gt; "2 credits/day" was a marketing-friendly framing that didn't survive contact with the meter. A trial isn't a daily allowance — it's a budget for the user to do &lt;em&gt;one real thing&lt;/em&gt; and see the product work. The 30,000-credit trial gets close to that in our case (0.4 tasks; not quite); the "100,000 free for the trial period" framing other people use is more honest because the unit is the natural unit of consumption. If your trial isn't large enough to complete one canonical end-to-end use, it isn't a trial, it's a teaser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When you find a 50,000x gap, the right move is to align AND drop prices.&lt;/strong&gt; This is the part I almost skipped. The natural alignment was "raise the LP's credit numbers to match the SSOT" — keep the dollar prices, just stop saying "100 credits" when the meter gives you a million. But that would have meant a Pro plan at $333/month for 1,000,000 credits, which (at the actual cost-per-task) is over-priced relative to what we'd want to charge a developer dogfooding the tool. The alignment commit dropped the LP prices simultaneously: Pro $333 → $67, Business $1,000 → $333. The arithmetic justifying the drop is "we now know what it actually costs us per task; let's price the tier at a margin we'd agree to publish." Aligning copy without re-pricing would have produced a technically-correct LP that was strategically wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the cross-check into CI.&lt;/strong&gt; This is what I haven't done yet, and it's the thing that worries me most. There's no test that fails when the LP and the seed migration disagree. The fix is straightforward — parse the rendered HTML, extract the &lt;code&gt;data-tier&lt;/code&gt; and &lt;code&gt;data-credits&lt;/code&gt; attributes (which we'd need to add), assert they match a query against the &lt;code&gt;plans&lt;/code&gt; table — but it's the kind of work that doesn't ship until it bites you a second time. I'd rather not have a second time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you've shipped a SaaS product where the meter and the marketing copy live in different repos, I'm curious how you've kept them in sync — or whether you have. The failure mode I described doesn't show up in any CI signal I've seen; it shows up in customer conversations, weeks or months after the divergence happens. Has anyone wired up a "render the LP, scrape the prices, diff against the database seed" check that actually works? Or is everyone relying on "marketing reads engineering's PRs," which is the implicit policy that quietly fails the moment marketing copy is in a separate repo?&lt;/p&gt;

&lt;p&gt;The deeper thing I haven't resolved: the LP isn't just numbers. It's also abstractions — phrases like "1 credit per task" that aren't literal claims but framings. Those drift even when the numbers are right. I don't have a CI check for "is this metaphor still true." The honest answer for now is "review the LP copy whenever the meter changes," which is a process, not a guarantee. Curious whether anyone has done better.&lt;/p&gt;

</description>
      <category>saas</category>
      <category>pricing</category>
      <category>stripe</category>
      <category>ai</category>
    </item>
    <item>
      <title>The unit you pass between agents is the architecture — Purple to Blue with the implementation diff</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Tue, 05 May 2026 04:04:37 +0000</pubDate>
      <link>https://dev.to/zoetaka38/the-unit-you-pass-between-agents-is-the-architecture-purple-to-blue-with-the-implementation-diff-57cc</link>
      <guid>https://dev.to/zoetaka38/the-unit-you-pass-between-agents-is-the-architecture-purple-to-blue-with-the-implementation-diff-57cc</guid>
      <description>&lt;p&gt;We had a workflow where one agent (the orchestrator) would land an implementation, then hand the project off to a second agent (the QA one) to generate E2E tests. The QA agent did everything right by every prompt-engineering checklist. It read the codebase. It enumerated user flows. It produced clean Playwright scenarios with realistic selectors. The tests ran. They passed. They were green and beautiful and they were testing the wrong feature.&lt;/p&gt;

&lt;p&gt;Specifically: the orchestrator had just changed how the invitation signup form handles a new error state. The QA agent had no idea. It saw "invitation signup" in the task title, read the codebase from scratch, found the most prominent invitation flow (the one we'd shipped two months earlier), and wrote a thorough test for that. The new error state — the actual delta — was untouched.&lt;/p&gt;

&lt;p&gt;The fix is embarrassingly small in retrospect. The orchestrator already knew what it had changed; it just wasn't telling the QA agent. We started passing the git diff. Test scope tightened immediately.&lt;/p&gt;

&lt;p&gt;I'm building an AI dev harness called Codens; the relevant context here is that the orchestrator agent (Purple) and the QA agent (Blue) are separate services with separate Celery workers, separate prompts, and separate Claude Agent SDK contexts. They communicate over an internal HTTP API. The implementation-diff handoff is one HTTP field, but it's the field that decides whether Blue's test generation is precise or a confident guess.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why agents need the diff handed to them
&lt;/h2&gt;

&lt;p&gt;The naïve mental model — the one I had for longer than I should have — is that any agent with codebase access can figure out "what changed" by itself. &lt;code&gt;git log -1&lt;/code&gt;, &lt;code&gt;git diff HEAD~1&lt;/code&gt;, read the files. The information is right there. Why should the orchestrator pre-compute it?&lt;/p&gt;

&lt;p&gt;The reasons fall into three buckets, and they're not obvious until you watch a generation go off the rails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The QA agent has the wrong frame.&lt;/strong&gt; Blue's prompt is built around "generate test scenarios for &lt;em&gt;this feature&lt;/em&gt;." Without a diff, "feature" gets inferred from the task goal text — usually a sentence pulled from the Notion ticket. "Add error handling to invitation signup" expands in the agent's working memory to "the invitation signup feature," and the agent tests the whole feature. The diff would have collapsed that frame: it's not the feature, it's the four lines that changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subtle changes get lost.&lt;/strong&gt; Some implementation changes don't show up in the file structure or the route table. A condition tightened from &lt;code&gt;&amp;gt;&lt;/code&gt; to &lt;code&gt;&amp;gt;=&lt;/code&gt;. A default flipped from &lt;code&gt;true&lt;/code&gt; to &lt;code&gt;false&lt;/code&gt;. A new field added to a request schema that's only validated server-side. An agent walking the codebase from scratch will read the changed file but has no reason to focus on those lines specifically — they look like the rest of the file. The diff's job is to point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test scope drifts toward the user-facing.&lt;/strong&gt; Agents enumerating scenarios from a feature description gravitate toward end-to-end happy paths because those read like "what a user does." But the change might be a backend-only refactor that doesn't change any user-visible behavior — its tests should be assertion-heavy at a different layer. The diff is the only signal that tells the QA agent "the change isn't what the user does, it's how this function rejects malformed input now."&lt;/p&gt;

&lt;p&gt;You can mitigate all of this by writing a sharper task description. We tried. Task descriptions kept being the things QA tickets are written as: a goal, a user, a desired outcome. They are by design about features, because they're about what the human asked for. The diff is the thing that says what the agent actually did with that ask.&lt;/p&gt;

&lt;h2&gt;
  
  
  The handoff: Purple to Blue, over HTTP
&lt;/h2&gt;

&lt;p&gt;The wire format is plain. Purple has the diff sitting in the working tree after its develop step (it just made the changes). It serializes the diff as a string and POSTs it to one of Blue's two internal test-generation endpoints. Blue's request schema declares the field as optional:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GenerateE2ETestsFromTaskRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ... task_goal, target_url, acceptance_criteria, test_count, auth_config ...
&lt;/span&gt;    &lt;span class="n"&gt;code_diff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Git diff of the implementation changes (git diff &amp;lt;base&amp;gt;...HEAD). &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;When provided, Claude uses the actual code changes to generate more &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;precise test scenarios.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two paths consume that field. The fast path is scenario extraction — Blue calls Claude once with the task goal and the diff, and gets back a list of scenario strings. The slow path is the exploratory agent — a multi-phase Claude Agent SDK session that walks the codebase with an MCP server, then drives a Playwright browser, then synthesizes tests. Both paths needed to learn the diff.&lt;/p&gt;

&lt;p&gt;The fast path's prompt change is the boring one, and that's the point. In &lt;code&gt;claude_client.py&lt;/code&gt; the user prompt got a single conditional block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;diff_section&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;code_diff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;diff_section&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;実装差分（この実装内容を参照してテストシナリオを生成してください）：&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
diff&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;code_diff&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
```&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;タスクゴール：
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task_goal&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;diff_section&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
上記のタスクゴール&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;と実装差分&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;code_diff&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;から最大&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;個のE2Eテストシナリオを抽出してください。
各シナリオは「- 」で始まる形式で1行ずつ出力してください。&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(The Japanese is incidental — Blue's QA prompts are localized for our primary use case. The structure is what matters: append the diff to the user prompt as a fenced&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
 block, mention it in the closing instruction sentence, leave the system prompt unchanged.)

The system prompt got one new bullet:



```plaintext
4. 実装コードが提供されている場合は、実際に実装されたエンドポイントやUIパスを優先してテストする
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Roughly: "if implementation code is provided, prioritize testing the endpoints and UI paths that were actually implemented." That's the steering signal. Without it, Claude reads the diff but treats it as supplementary context — interesting but not authoritative. The bullet flips the priority: the diff is the source of truth about what to test, the task goal is the source of truth about &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The exploratory path is more interesting because the diff lands in a different phase. The exploratory agent runs three phases — code analysis, browser discovery, test synthesis — and only the first phase needs the diff. The agent client gets the diff in its constructor and stitches it into the code-analysis prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;diff_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_code_diff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;diff_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;IMPLEMENTATION DIFF (focus your analysis on these changed files):&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
diff&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_code_diff&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
```&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Analyze this codebase for E2E testing.

Exploration goal: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exploration_goal&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Focus areas: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;focus_areas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;focus_areas&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;All features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;diff_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Start by getting the file structure, then read key page/component files.
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Pay special attention to the files and endpoints shown in the diff above.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_code_diff&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "pay special attention" sentence is doing real work. The exploratory agent has tools — it'll happily spend ten turns reading files that have nothing to do with the change. With the diff and the steering sentence, the analysis phase concentrates its file-reading budget on the changed files, which means the browser-discovery phase that follows knows which UI paths to actually exercise.&lt;/p&gt;

&lt;p&gt;The endpoint that fronts the exploratory path is a Celery launcher — POST returns a 202 with a task ID, and the worker runs for up to thirty minutes. The diff threads through the Celery kwargs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_exploratory_e2e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proj_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exploration_goal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exploration_goal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# ... browser config, viewport, timeout, auth ...
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_diff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_diff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal_exploratory_e2e_task_submitted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;proj_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proj_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;has_diff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_diff&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;has_diff&lt;/code&gt; log line is in there for a reason I'll come back to under pitfalls.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed in the test output
&lt;/h2&gt;

&lt;p&gt;I want to be careful about what I claim here. We don't have a controlled measurement of test quality before and after — running the same task through both versions is hard when the orchestrator's behavior depends on the codebase state, and "test quality" is itself a contested metric. So I'll stick to qualitative changes I can point to in actual generated tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope tightened.&lt;/strong&gt; Scenario lists got shorter and more specific. Before: "Test invitation signup happy path, test invitation signup with invalid email, test invitation signup as existing user, test resend invitation email." After (same task): "Test invitation signup with the new server-rejected error showing the inline message, test invitation signup retry after the error clears." The first list was four scenarios that all happened to mention invitation signup. The second was two scenarios that exercised the actual change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Selectors leaned on changed UI.&lt;/strong&gt; In the exploratory path, the browser-discovery phase was the noisier part — the agent would click through ten elements before settling on what to test. With the diff, it tended to navigate straight to the changed component. Both versions ended up at correct selectors, but synthesis had cleaner exploration history to draw from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subtle changes started getting tests.&lt;/strong&gt; The &lt;code&gt;&amp;gt;&lt;/code&gt; to &lt;code&gt;&amp;gt;=&lt;/code&gt; class of change. Before, these almost always produced a "test feature X works correctly" scenario that didn't probe the boundary specifically. After, the diff made the boundary visible and tests started exercising the exact value where the condition flipped.&lt;/p&gt;

&lt;p&gt;What didn't change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality of the test itself.&lt;/strong&gt; The Playwright code that came out the back end was about as good as before — selectors, waits, assertions. The diff doesn't help Claude write better Playwright; it helps Claude pick what to write Playwright &lt;em&gt;for&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance on large diffs.&lt;/strong&gt; A refactor diff that touches forty files and ten thousand lines is mostly noise from the agent's perspective. It's hard to focus on "the change" when the change is everywhere. The agent doesn't gracefully fall back to feature-level testing in that case — it kind of muddles through. Honest answer: large diffs are still hard, and we don't have a great solution beyond "split the task smaller upstream."&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfalls along the way
&lt;/h2&gt;

&lt;p&gt;Things that bit us, in roughly the order they bit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diff format choice.&lt;/strong&gt; The first version of this passed &lt;code&gt;git diff --name-only&lt;/code&gt; — just the file list. The reasoning was "it's compact, the agent can read the files itself." The reasoning was wrong: the agent could read the files but still had no signal about &lt;em&gt;which lines&lt;/em&gt; changed. We switched to full unified diff (&lt;code&gt;git diff &amp;lt;base&amp;gt;...HEAD&lt;/code&gt;). Tests got better. Tokens went up; we'll come back to that. We also considered &lt;code&gt;git diff --stat&lt;/code&gt; as a middle ground but the line counts are a weak signal and the agent didn't use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context window economy.&lt;/strong&gt; A unified diff that's two thousand lines isn't free. Blue's scenario-extraction Claude call has a 1024-token max output but the input includes the system prompt, task goal, and diff — and the diff is the variable. We watched a few large-refactor tasks hit the context limit and either truncate or fail. Current behavior is "send the whole diff and hope," which works for typical task-sized changes (tens to hundreds of changed lines) and fails ungracefully for refactors. The cleaner solution is something like "summarize the diff before sending if it exceeds a threshold," but summarizing a diff is itself a Claude call and we haven't done it. The crude mitigation is that Purple's task graph tries to keep tasks small, which makes diffs small as a side effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Purple commit timing vs Blue trigger timing.&lt;/strong&gt; This is the subtle one. The diff Purple sends to Blue is computed at trigger time, not commit time. If Purple commits, then does another step, then triggers Blue, the diff Blue sees has to either be (a) recomputed at trigger time from the commit history, or (b) cached from the develop step. We started with (b) — store the diff right after develop, pass it through the workflow context to the trigger step. That worked until a step in between modified files (e.g., a follow-up cleanup step). The cached diff was now stale. Switching to (a) — recompute from &lt;code&gt;git diff &amp;lt;base&amp;gt;...HEAD&lt;/code&gt; at trigger time, where &lt;code&gt;&amp;lt;base&amp;gt;&lt;/code&gt; is the parent of the develop commit — fixed it. The lesson: the diff is a function of two refs in the workflow, not a snapshot. Treat it as a query, not a value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "did the diff actually arrive" question.&lt;/strong&gt; When tests came back wrong, we needed to know whether the diff failed to send or whether the diff was sent and the agent ignored it. That's why the structured log line in the task launch logs &lt;code&gt;has_diff&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal_exploratory_e2e_task_submitted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;proj_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proj_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;has_diff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_diff&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a one-bit signal but it cuts the diagnosis space in half. Without it, every "the test scope was wrong" investigation started with a five-minute trace through Purple's HTTP client to figure out what payload it actually sent. With it, you check the log and either move on (diff was there, prompt issue) or look upstream (diff wasn't there, send issue).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backward compatibility.&lt;/strong&gt; The endpoint had to keep working for callers that don't send the diff — there are integration-test paths and direct-API consumers that hit it without going through Purple. So the field is &lt;code&gt;Optional[str]&lt;/code&gt; and the prompt-construction code conditions every diff-related sentence on &lt;code&gt;if self._code_diff&lt;/code&gt;. The unit tests cover both shapes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_generate_accepts_code_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;generate endpoint accepts code_diff and passes it to ClaudeClient.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# ... mock setup ...
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;TestClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/v1/internal/projects/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proj_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/e2e-tests/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_goal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Test login&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_diff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Internal-Api-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;
    &lt;span class="n"&gt;call_kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;mock_claude&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate_test_scenarios_from_task_goal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;call_kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_diff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_generate_without_code_diff_still_works&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;generate endpoint works when code_diff is omitted (backward compat).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# ... mock setup, no code_diff in body ...
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "without_code_diff" test is the cheap insurance we needed because the prompt-construction code is full of conditional inserts and it would be very easy to break the no-diff path with an f-string mistake.&lt;/p&gt;

&lt;h2&gt;
  
  
  The general principle
&lt;/h2&gt;

&lt;p&gt;The thing I keep coming back to: in a multi-agent system, &lt;strong&gt;the substance you pass between agents is more architecturally consequential than the prompt to any single agent.&lt;/strong&gt; I had spent months tuning Blue's test-generation prompt — adding constraints, refining output format, picking better examples. None of that prompt-tuning closed the gap that one HTTP field closed in a day.&lt;/p&gt;

&lt;p&gt;The reason is structural. Blue's prompt is a function — given inputs, produce tests. The quality ceiling of that function is bounded by the inputs. If the inputs don't include "what changed," no amount of prompt cleverness will produce tests that are about the change, because the information isn't in the function's domain. The agent will find &lt;em&gt;something&lt;/em&gt; to test, and it will test that something well, but the something won't be the change.&lt;/p&gt;

&lt;p&gt;This generalizes. In any pipeline where one agent's output feeds another's input, the question "what is the unit of work I'm passing between agents" is the architecture question. The candidates I've seen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The task description.&lt;/strong&gt; "Add error handling to invitation signup." Coarse. Loses information about implementation details.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The diff.&lt;/strong&gt; What we landed on for Purple to Blue. Captures what changed; doesn't capture intent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff plus task description.&lt;/strong&gt; What Purple actually sends. Diff is what changed, task is why. Both are needed; diff alone makes good tests for the wrong reason, task alone makes plausible tests for the wrong feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A structured task spec.&lt;/strong&gt; Task ID, acceptance criteria, file list, diff, test commands, auth config. We pass this for some agent transitions where Blue needs more than prose context. Heavier; harder to construct; easier to consume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The conversation log.&lt;/strong&gt; Some teams pass the previous agent's full chain-of-thought. Has the most information; has way too much noise; the receiving agent's context window pays for it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each option is a tradeoff between completeness and signal-to-noise. The diff was a sweet spot for the test-generation handoff specifically because (a) it's compact relative to a full conversation log, (b) it's structured enough that Claude reliably parses it, and (c) the information it carries — &lt;em&gt;which lines of which files changed&lt;/em&gt; — was exactly the signal Blue was missing. For other handoffs the right unit is different. The PRD-to-task-graph handoff in green-codens passes a structured spec; the bug-to-fix handoff in red-codens passes a stack trace plus repro. There's no universal answer; the question is "what does the receiver need to do its job, and what's the minimum form that carries it."&lt;/p&gt;

&lt;p&gt;The mental shift, for me, was from "let me make each agent better" to "let me make the channels between agents richer." The agents were already capable. What they needed was less ambiguous inputs. Once you frame the system as a graph of agents with edges that carry typed payloads, the design moves are about the edges — what gets passed, in what format, with what guarantees about freshness — not about the nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you're shipping a multi-agent system, I'm curious what unit you've landed on for the inter-agent payload. Is it a task description, a structured spec, a diff, a full conversation log, something else? What was the failure mode that pushed you toward that specific choice — was it scope drift like ours, or context-window economics, or something on the receiving agent's reliability? And once you'd picked it, what was the next thing that broke that you wish you'd designed for from the start?&lt;/p&gt;

&lt;p&gt;The thing I haven't resolved yet, and would love to hear about: how do you handle the case where the diff is too large to send? Our current "split tasks smaller upstream" answer pushes the problem onto Purple's planning agent, but that's a constraint on planning, not a solution for QA. A "summarize the diff" pre-step is the natural next move, but compressing a diff loses the line-level precision that made the diff useful in the first place. Whoever has solved this for refactor-sized changes — I'd genuinely like to compare notes.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>testing</category>
      <category>playwright</category>
    </item>
  </channel>
</rss>
