<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sergei Parfenov</title>
    <description>The latest articles on DEV Community by Sergei Parfenov (@p0rt).</description>
    <link>https://dev.to/p0rt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F157612%2F7bffbdb1-8c3d-47ed-9575-80cee50d05f2.jpeg</url>
      <title>DEV Community: Sergei Parfenov</title>
      <link>https://dev.to/p0rt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/p0rt"/>
    <language>en</language>
    <item>
      <title>The Most Powerful Model on the Market Got Pulled by the Government in 3 Days. Is It Real, or a Hype Bubble?</title>
      <dc:creator>Sergei Parfenov</dc:creator>
      <pubDate>Sat, 13 Jun 2026 14:46:35 +0000</pubDate>
      <link>https://dev.to/p0rt/the-most-powerful-model-on-the-market-got-pulled-by-the-government-in-3-days-is-it-real-or-a-hype-fce</link>
      <guid>https://dev.to/p0rt/the-most-powerful-model-on-the-market-got-pulled-by-the-government-in-3-days-is-it-real-or-a-hype-fce</guid>
      <description>&lt;p&gt;The timing is almost too clean to be real.&lt;/p&gt;

&lt;p&gt;On June 9, Anthropic shipped &lt;strong&gt;Claude Fable 5&lt;/strong&gt; — a "Mythos-class" model they described as more capable than anything they'd previously made generally available. Three days later, on June 12, the US Commerce Department sent a letter to CEO Dario Amodei placing Fable 5 (and its restricted sibling Mythos 5) under export controls: no access for any location outside the US, and no access for foreign persons inside it.&lt;/p&gt;

&lt;p&gt;Anthropic couldn't filter non-US users from everyone else in real time. So they did the only thing they could: &lt;strong&gt;they killed the model for everyone, worldwide. Including US citizens.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you opened a session this weekend and got &lt;em&gt;"there's an issue with the selected model (claude-fable-5)... you may not have access to it"&lt;/em&gt; — that's not your setup. The model your session pointed at was pulled. Your projects, history, and limits are untouched; only which model answers you changed. Switch to Opus 4.8 or Sonnet and you're back.&lt;/p&gt;

&lt;p&gt;Now the question worth actually thinking about: &lt;strong&gt;is this real, or is everyone inflating a bubble around a model nobody can even use right now?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The honest answer is &lt;em&gt;both&lt;/em&gt;, and the interesting part is separating the two.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's genuinely new here
&lt;/h2&gt;

&lt;p&gt;Strip the drama and there's a real precedent underneath.&lt;/p&gt;

&lt;p&gt;AI export controls, until this week, were about &lt;strong&gt;hardware&lt;/strong&gt;: chips, lithography machines, the physical supply chain. The chokepoint was always silicon. What just happened is different in kind — the government reached past the hardware and pulled a &lt;em&gt;deployed, commercial software model&lt;/em&gt; that hundreds of millions of people were already using.&lt;/p&gt;

&lt;p&gt;That's the part to file away. It means a frontier model is now being treated less like a product and more like a dual-use technology with an off-switch held by someone other than the vendor. If you build on these APIs, model availability is no longer just an SLA question or a "will the vendor deprecate it" question. It's a geopolitical dependency. That's a real shift in how you should think about resilience — treat your model provider like any critical supply-chain vendor, with a fallback path that doesn't assume the top model stays reachable.&lt;/p&gt;

&lt;p&gt;So: precedent — real. Worth tracking. Not hype.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the bubble is
&lt;/h2&gt;

&lt;p&gt;Here's where I think a lot of the coverage is doing unpaid marketing.&lt;/p&gt;

&lt;p&gt;The official justification is a &lt;strong&gt;jailbreak&lt;/strong&gt; — reportedly surfaced by another company and escalated to the government as a national-security concern. Anthropic's own response, which is the most useful document in this whole episode, says the quiet part plainly: the technique they were shown exposed a &lt;em&gt;small number of previously known, minor vulnerabilities&lt;/em&gt; — the kind that &lt;strong&gt;other publicly available models find without any jailbreak at all&lt;/strong&gt; (they name-check a competing GPT-class model). In other words, the "national security threat" rests on a narrow, non-universal exploit, not on some unique cliff-edge capability that only Fable 5 possesses.&lt;/p&gt;

&lt;p&gt;Now layer on the incentive structure. There is no better marketing in this industry than &lt;em&gt;"a model so powerful the government had to ban it."&lt;/em&gt; That sentence sells capability, sells the safety narrative ("we build things genuinely dangerous enough to be regulated"), and sells it for free, in every headline, with the government as an involuntary co-signer. The halo effect is enormous, and it maps perfectly onto a story the market already wants to believe and already prices into valuations.&lt;/p&gt;

&lt;p&gt;I'm not saying anyone engineered this. I'm saying notice how neatly a suspension you didn't choose reinforces the exact narrative that benefits you most.&lt;/p&gt;

&lt;h2&gt;
  
  
  So what's actually true?
&lt;/h2&gt;

&lt;p&gt;Let me be concrete, because vagueness is how bubbles survive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The capabilities are real.&lt;/strong&gt; Fable 5 is priced at $10 / $50 per million input/output tokens — roughly double Opus 4.8 — and counts as 2x usage on subscription plans. You don't price a model like that, or burn that much compute on it, for a phantom. There's a genuinely strong model here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The regulatory precedent is real.&lt;/strong&gt; First time a deployed commercial model has been pulled by export control. That changes the risk model for everyone shipping on top of these APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "existential / too-dangerous-to-exist" framing is mostly bubble.&lt;/strong&gt; It's assembled from one government's reaction to one narrow jailbreak, plus a halo that happens to be extremely convenient for the vendor. Anthropic itself is arguing the directive is a misunderstanding and that the exploit is neither unique nor severe — which is a strange thing to argue if you actually believed your model was a civilizational hazard.&lt;/p&gt;

&lt;p&gt;My read: hold both thoughts at once. &lt;strong&gt;The governance story is the real headline. The "scariest model ever" story is the one selling tickets.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do if you build on this
&lt;/h2&gt;

&lt;p&gt;Practical, not philosophical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't hard-code your default to a frontier model you don't control the availability of.&lt;/strong&gt; Set a fallback chain (Fable → Opus 4.8 → Sonnet) and make sure your app degrades, not breaks, when the top model vanishes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reserve the expensive model for the tasks that earn it&lt;/strong&gt; — long agentic runs, hard refactors, genuinely multi-step reasoning. At 2x cost and 2x usage, defaulting everything to the top tier is just lighting money on fire even when it &lt;em&gt;is&lt;/em&gt; available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat model availability as a supply-chain risk&lt;/strong&gt; in your architecture docs. This won't be the last time a model you depend on disappears for reasons that have nothing to do with you.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is gone for now. No firm return date — Anthropic says it's working to restore access and frames the whole thing as a misunderstanding. Until then, Opus 4.8 still does the job for the overwhelming majority of what any of us actually ship.&lt;/p&gt;

&lt;p&gt;The model left. The narrative is still here, doing its job.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>news</category>
    </item>
    <item>
      <title>You Fixed the Rate Limits. Now Your Agent Fails Quietly.</title>
      <dc:creator>Sergei Parfenov</dc:creator>
      <pubDate>Thu, 11 Jun 2026 16:58:21 +0000</pubDate>
      <link>https://dev.to/p0rt/you-fixed-the-rate-limits-now-your-agent-fails-quietly-3keo</link>
      <guid>https://dev.to/p0rt/you-fixed-the-rate-limits-now-your-agent-fails-quietly-3keo</guid>
      <description>&lt;p&gt;Last week I wrote that &lt;a href="https://dev.to/p0rt/your-ai-agent-isnt-failing-because-it-hallucinates-its-failing-because-of-rate-limits-2d60"&gt;your agent isn’t failing because it hallucinates — it’s failing because of rate limits&lt;/a&gt;. The capacity-engineering toolkit in that post — concurrency caps, backoff with jitter, fallback models, caching — is real and it works. Deploy it and your agent stops dying.&lt;/p&gt;

&lt;p&gt;Then a commenter (ANP2) pointed out the thing the post undersold, and it’s been stuck in my head since: &lt;strong&gt;every one of those fixes quietly opens a correctness hole while it closes the availability one.&lt;/strong&gt; This post is me paying that comment thread its due, because the second half of the story turns out to matter more than the first.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — A 429 is a &lt;em&gt;loud&lt;/em&gt; failure: you see it, you alert on it, you fix it. Retries, fallbacks, and caches keep the agent alive — but they let it act on output it didn’t freshly earn: a stale cache hit, a different model’s answer, a re-run side effect. You’ve traded loud failures for quiet ones. The fix is to treat &lt;strong&gt;availability&lt;/strong&gt; (“can I serve this?”) and &lt;strong&gt;correctness&lt;/strong&gt; (“can I still trust the result?”) as two separate gates — and to propagate trust &lt;em&gt;across the agent’s chain&lt;/em&gt;, not just per call.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The trade you didn’t know you made
&lt;/h2&gt;

&lt;p&gt;Here’s the uncomfortable symmetry. The whole point of my last post was that the dominant production failure mode isn’t the model being wrong — it’s the plumbing saying no. The capacity toolkit fixes the plumbing. But look at what each fix actually does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A retry&lt;/strong&gt; re-runs a call. If that call had a side effect — created a ticket, sent a message, committed a change — the retry runs the side effect &lt;em&gt;again&lt;/em&gt;. The agent didn’t fail; it succeeded twice, which is its own kind of wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A fallback model&lt;/strong&gt; answers when the primary is rate-limited. But it’s a different model: different training, different calibration, different failure modes. The task continues on an answer the primary never produced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A cache hit&lt;/strong&gt; serves a response generated for an earlier input. If the world moved — the codebase changed, the data updated — the cached answer can be subtly stale for &lt;em&gt;this&lt;/em&gt; request while looking perfectly fresh.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each mechanism keeps the agent &lt;strong&gt;up&lt;/strong&gt;. None of them guarantees the agent is &lt;strong&gt;right&lt;/strong&gt;. And the cruel part is the failure economics: the 429 you eliminated was honest — visible, countable, alertable. The failures you bought instead are silent. The agent stays up and is confidently wrong, which is exactly the failure mode the hallucination-hunters were worried about in the first place — just arriving through the plumbing instead of the model.&lt;/p&gt;

&lt;p&gt;The reliability you bought is &lt;strong&gt;uptime, not correct uptime&lt;/strong&gt;. (That phrase is ANP2’s, and it’s better than anything in my original post.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Two gates, not one
&lt;/h2&gt;

&lt;p&gt;The conversation in that thread converged on a framing I now use everywhere: an agent’s runtime layer has to answer two different questions, and conflating them is where the quiet failures breed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate 1 — “Can I serve this?”&lt;/strong&gt; This is the availability gate. Trip the fallback on 429s, serve the cache on a hit, retry on transient errors. Another commenter (Echo) nailed the key property of this gate: when you trip a fallback &lt;em&gt;only on rate-limit errors&lt;/em&gt; — never on bad outputs — the failure mode you’ve introduced is &lt;strong&gt;latency, not quality&lt;/strong&gt;. The fallback just buys time. That’s a fine trade, and it’s why the capacity toolkit is still the right first move.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate 2 — “Can I act on this irreversibly?”&lt;/strong&gt; This is the correctness gate, and it’s where the degraded outputs from Gate 1 must get re-examined. The moment an output is about to feed something you can’t take back — a merge, a payment, a message to a user, a deleted record — its &lt;em&gt;provenance&lt;/em&gt; matters. Did it come from the primary, fresh? Or from a fallback, a cache, a retry?&lt;/p&gt;

&lt;p&gt;One rule worth stealing here: &lt;strong&gt;gate on risk, not on confidence.&lt;/strong&gt; There’s a war story making the rounds of an agent that was 95% confident about a production database migration — the missing 5% was a foreign-key constraint absent from its test data, and the only thing that prevented corrupted referential integrity across three tables was a hard rule that destructive operations always require human approval, &lt;em&gt;regardless of confidence&lt;/em&gt;. Confidence is the model grading itself; irreversibility is a property of the action. Gate on the second.&lt;/p&gt;

&lt;p&gt;The two gates fail differently, and that’s the point: Gate 1 failures cost you time; Gate 2 failures cost you trust. A system with only Gate 1 is fast and quietly dangerous. A system with only Gate 2 is safe and constantly down. You need both, and they need to stay separate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-call correctness: the three tags
&lt;/h2&gt;

&lt;p&gt;The minimum viable version of Gate 2 is making degraded outputs &lt;em&gt;identifiable&lt;/em&gt;. Three mechanisms, one per capacity fix:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Idempotency keys on anything with side effects.&lt;/strong&gt; Before an agent action that touches the world, generate a key from the task + step + inputs. The receiving system deduplicates on it. Now a retry is safe by construction — the second execution is a no-op instead of a double-fire. This is decades-old distributed-systems practice; agent frameworks have mostly just… not adopted it yet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# pass it with the side-effecting call; the receiver dedupes on it
&lt;/span&gt;&lt;span class="nf"&gt;create_ticket&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The grown-up version of this is the &lt;strong&gt;saga pattern&lt;/strong&gt; from distributed systems: each step records its completion and defines a compensation action, so a task that dies at step 4 of 7 can roll back cleanly instead of orphaning state. Idempotency prevents duplicate effects; sagas handle partial completion. Once your agents fail mid-workflow — and they will — you eventually want both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Trust tags on fallback outputs.&lt;/strong&gt; When the fallback answers instead of the primary, don’t just return the text — return &lt;code&gt;(text, trust="degraded")&lt;/code&gt;. Cheap to add, and it’s the hook everything downstream needs. A degraded answer is fine for the agent to &lt;em&gt;keep thinking with&lt;/em&gt;; it is not fine to &lt;em&gt;act irreversibly on&lt;/em&gt; without a re-check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Validity conditions on cache entries.&lt;/strong&gt; A cache entry shouldn’t just store the response — it should store what the response &lt;em&gt;assumed&lt;/em&gt;: which file version, which data snapshot, which config. On a hit, check the assumptions, not just the key. If the codebase moved since the entry was written, that’s a miss wearing a hit’s clothes. And the assumptions can move without you touching anything: providers silently update models, document stores drift, input distributions shift — degradation with no error to catch. Your “primary, fresh” answer from last Tuesday may already be a fallback in disguise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part single calls don’t prepare you for: trust must propagate
&lt;/h2&gt;

&lt;p&gt;Here’s where agents make this genuinely harder than classic distributed systems, and it’s the piece I’d add on top of the thread that started this post.&lt;/p&gt;

&lt;p&gt;Say step 3 of a 6-step task came from a lower-trust fallback. Steps 4, 5, and 6 each run on the primary, fresh, individually flawless. Are they trustworthy?&lt;/p&gt;

&lt;p&gt;No — and this is the trap. &lt;strong&gt;They reasoned on top of a degraded input.&lt;/strong&gt; This isn’t a niche concern, either: observability vendors who cluster production agent traces report that &lt;em&gt;chained corruption&lt;/em&gt; — one bad step at position N silently poisoning everything after it — is the single most common and most insidious agent failure mode they see. And the math is brutal: at a 95% per-step success rate, an 8-step task completes cleanly ~66% of the time; at 85% per step, it’s ~27%. The chain is where reliability goes to die, quietly. Each step is locally correct and the trajectory is still poisoned. If the trust tag stays local to the call that produced it, the degraded answer launders itself: two “clean” hops later it looks pristine, and your irreversibility gate at step 6 checks the last call’s tag, sees green, and fires.&lt;/p&gt;

&lt;p&gt;So the tag can’t be per-call metadata. It has to &lt;strong&gt;taint&lt;/strong&gt; — propagate to everything downstream of it, the way taint-tracking works in security analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StepResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;trust&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;          &lt;span class="c1"&gt;# "full" | "degraded"
&lt;/span&gt;    &lt;span class="n"&gt;tainted_by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# which upstream steps were degraded
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;propagate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;StepResult&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;my_trust&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;taint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tainted_by&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;taint&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step_id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trust&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;degraded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;# my own trust can't exceed the weakest input
&lt;/span&gt;    &lt;span class="n"&gt;trust&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;degraded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;taint&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;my_trust&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;degraded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;trust&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taint&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the irreversibility gate checks the &lt;strong&gt;aggregate trust of the whole trajectory&lt;/strong&gt;, not the last hop: if anything upstream was degraded and unverified, the action pauses for a re-check — re-run the degraded step on the primary, or escalate to a human. In my experience the re-check fires rarely; the point isn’t that fallbacks are usually wrong, it’s that the one time the degraded path feeds a merge or a payment, you want it caught at the gate instead of in the incident review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making it observable (or it didn’t happen)
&lt;/h2&gt;

&lt;p&gt;Same lesson as the capacity post, one level up. You can’t engineer what you can’t see, and correctness debt is even quieter than 429s. The minimum dashboard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;% of completed tasks with any degraded step&lt;/strong&gt; — your real exposure, invisible in error rates because nothing errored.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;% of irreversible actions that fired with taint&lt;/strong&gt; — should be ~zero; every one is a gate you skipped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache validity-miss rate&lt;/strong&gt; — hits that failed the assumption check. If this is zero, you’re probably not checking assumptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback divergence&lt;/strong&gt; — periodically replay fallback-answered requests on the primary and diff. This is your measured answer to “how different is the fallback, actually?” instead of a vibe.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these show up in uptime. All of them are the difference between uptime and correct uptime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The capacity toolkit from the last post is still step one — an agent that’s down helps nobody. But availability engineering has a hidden invoice: every mechanism that keeps the agent alive does it by substituting something for the fresh, primary, verified answer. That substitution is usually fine — which is exactly what makes it dangerous, because “usually fine” plus “irreversible” plus “silent” is how you get the 3am incident that no alert predicted.&lt;/p&gt;

&lt;p&gt;Two gates. Tag what’s degraded. Taint what it touches. Check the trajectory, not the last call, before anything you can’t undo.&lt;/p&gt;

&lt;p&gt;Uptime is table stakes. Correct uptime is the product.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sources &amp;amp; further reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://latitude.so/blog/ai-agent-failure-detection-guide" rel="noopener noreferrer"&gt;Detecting AI Agent Failure Modes in Production&lt;/a&gt;, Latitude (2026) — chained corruption as the most common and most insidious production failure mode.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.jztan.com/ai-agent-error-handling-patterns/" rel="noopener noreferrer"&gt;AI Agent Error Handling: 5 Patterns to Catch Silent Failures&lt;/a&gt;, Kevin Tan (2026) — the saga pattern, the 95%-confident migration story, and risk-based escalation.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.trantorinc.com/blog/ai-agent-failure-modes-what-goes-wrong-design-resilience" rel="noopener noreferrer"&gt;AI Agent Failure Modes: What Goes Wrong in Production&lt;/a&gt;, Trantor (2026) — silent quality degradation from provider model updates and store drift.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2602.21012" rel="noopener noreferrer"&gt;International AI Safety Report 2026&lt;/a&gt; — why agent failures are categorically riskier: actions in the world, no human in the loop.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/p0rt/your-ai-agent-isnt-failing-because-it-hallucinates-its-failing-because-of-rate-limits-2d60"&gt;My previous post on the capacity side&lt;/a&gt; — the availability toolkit this post is the second half of.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Credit where due: this post exists because ANP2 and Echo took the last one apart constructively in the comments — the “uptime, not correct uptime” framing and the latency-not-quality fallback distinction are theirs. Best argument I’ve had on this site. If you’re running agents in prod: do you track degraded-path exposure at all, or does your observability stop at error rates? Genuinely curious how rare Gate 2 is in the wild.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Comments Got Good. That's How I Knew.</title>
      <dc:creator>Sergei Parfenov</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:09:06 +0000</pubDate>
      <link>https://dev.to/p0rt/the-comments-got-good-thats-how-i-knew-42m9</link>
      <guid>https://dev.to/p0rt/the-comments-got-good-thats-how-i-knew-42m9</guid>
      <description>&lt;p&gt;&lt;em&gt;I wrote a post about model distillation. The comments were thoughtful, specific, technically sharp — and that's exactly what made me check whether any of them were written by people.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🧪 Everything here — the scraper, the detector, the simulation, the figures — is reproducible: &lt;strong&gt;&lt;a href="https://github.com/P0rt/the_cozy_web" rel="noopener noreferrer"&gt;github.com/P0rt/the_cozy_web&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;A few weeks ago I published &lt;a href="https://dev.to/p0rt/how-model-distillation-actually-works-and-what-the-china-distilled-our-model-headlines-really-3o0o"&gt;a post on how model distillation actually works&lt;/a&gt;. It did fine — 35 reactions, 14 comments. And the comments were &lt;em&gt;great&lt;/em&gt;. Not "great post, thanks for sharing" great. &lt;strong&gt;Substantively&lt;/strong&gt; great. People pushed back on my "the student is bounded by the teacher" claim with a real counter-example. Someone reframed distillation as "a forcing function for what you actually need." Someone dropped a paper recommendation. Someone shared a 20× cost number from production.&lt;/p&gt;

&lt;p&gt;I should have felt good. Instead I felt the thing you feel when a stranger knows your name. Something was off, and it took me a day to articulate what: &lt;strong&gt;the comments were too well-adapted.&lt;/strong&gt; Every one of them did the same three things in the same order, like they'd all read the same playbook. And a suspicious number of the accounts were two weeks old, or named after a product, or both.&lt;/p&gt;

&lt;p&gt;So I did what I do. I pulled the data. This is what I found, why I now think a real chunk of "engagement" on dev blogs is machine-generated or machine-shaped, and — because I don't trust my own pattern-matching — what the actual peer-reviewed research says about whether you can even tell anymore.&lt;/p&gt;




&lt;h2&gt;
  
  
  "Great post!" is dead. Meet the eco-comment.
&lt;/h2&gt;

&lt;p&gt;The old bot comment was easy. "Nice article, very informative, looking forward to more!" You could smell it. Anyone could.&lt;/p&gt;

&lt;p&gt;That's not what's under my posts anymore. The new thing is &lt;em&gt;substantive&lt;/em&gt; and &lt;strong&gt;ecological&lt;/strong&gt; — it adds real value, it's polite, it never picks a real fight, and it leaves the thread feeling cozier than before. Here's the actual skeleton, which I only saw once I'd read fourteen of them back to back:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Validate a specific phrase from the post.&lt;/strong&gt; Not generic praise — they quote &lt;em&gt;your&lt;/em&gt; framing back at you. "The 'separate the engineering from the geopolitics' framing is the public service here."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add one piece of genuine nuance.&lt;/strong&gt; "One thing I'd add…" "The part worth amplifying for builders…" Often a real, correct technical point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drop a first-person-plural anecdote with a number, naming a product.&lt;/strong&gt; "We use [model X] as our daily driver and the cost difference is roughly 20×." "When working with [our GPU product], we've seen…"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never, ever, actually disagree.&lt;/strong&gt; Even the "corrections" are framed so gently that I — the author — instantly conceded.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Read one, it's a great comment. Read eight, it's a &lt;strong&gt;template&lt;/strong&gt;. And step 3 is the tell: the technical substance isn't the point. It's the &lt;em&gt;wrapper&lt;/em&gt; around a product mention, engineered to be useful enough to clear a spam filter and an AI detector both.&lt;/p&gt;




&lt;h2&gt;
  
  
  My own thread, by the numbers
&lt;/h2&gt;

&lt;p&gt;I scraped my article's comments straight from the dev.to public API and ran them through two things: a detector I'd built earlier for the &lt;em&gt;old&lt;/em&gt; "Great post!" style, and a set of new structural signals. (&lt;a href="https://github.com/P0rt/the_cozy_web/blob/main/analyze_devto.py" rel="noopener noreferrer"&gt;&lt;code&gt;analyze_devto.py&lt;/code&gt;&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My old detector shrugged.&lt;/strong&gt; On the eight non-me comments it gave a mean "coziness" score of &lt;strong&gt;0.25&lt;/strong&gt; — i.e. it confidently waved them through as human. Of course it did: it was built to catch clichés, em-dashes, and uniform positivity, and these comments are armored with exactly the thing that defeats it — real specifics.&lt;/p&gt;

&lt;p&gt;The new signals told a different story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;product/company plug:              4 / 8 comments
opens by validating a phrase:      5 / 8 comments
comments that genuinely push back: 2 / 8   (and I conceded both, instantly)
auto-generated-looking username:   1   (a random-hex handle, 0 posts, "Thank you for this!")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I looked at &lt;em&gt;who&lt;/em&gt; was commenting. Public profiles, public join dates. I'm going to describe the patterns rather than pillory individuals — but the shapes were loud:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An account literally named after a product&lt;/strong&gt; ("Sealed GPUs. Private AI."), whose comment plugs that product. That one isn't a person; it's a brand broadcasting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A two-week-old persona account&lt;/strong&gt; — created days before my post — that plugs two named tools and somehow published five articles in its first fortnight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A throwaway&lt;/strong&gt; with a random-hex username, zero posts, and a one-line "Thank you for this!"&lt;/li&gt;
&lt;li&gt;A couple that &lt;strong&gt;look more human&lt;/strong&gt; — real names, older accounts — but still run the exact template and still ship a startup plug.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To be fair and clear: &lt;strong&gt;I can't prove any single one of these is a bot.&lt;/strong&gt; Some are probably real people running their comments through an assistant. But that distinction matters less than it sounds, and I'll come back to why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Is it just me? I swept 38 other posts.
&lt;/h2&gt;

&lt;p&gt;A pattern on one thread is an anecdote. So I pulled comments across 38 popular dev.to articles in &lt;code&gt;ai&lt;/code&gt;, &lt;code&gt;machinelearning&lt;/code&gt;, &lt;code&gt;webdev&lt;/code&gt;, and &lt;code&gt;programming&lt;/code&gt; — &lt;strong&gt;1,366 comments from 346 accounts&lt;/strong&gt; (&lt;a href="https://github.com/P0rt/the_cozy_web/blob/main/sweep_devto.py" rel="noopener noreferrer"&gt;&lt;code&gt;sweep_devto.py&lt;/code&gt;&lt;/a&gt;) — and looked for the same fingerprint.&lt;/p&gt;

&lt;p&gt;Two findings made the hair on my neck stand up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A handful of accounts spray the same template across dozens of unrelated posts.&lt;/strong&gt; The most prolific commenters in my sample showed up on &lt;strong&gt;14–22 distinct articles each&lt;/strong&gt; — several of them the same accounts that had appeared on my own thread, several of them flagged for product plugs. A human who loved your distillation post might also comment on three others. They don't leave structurally-identical "validate → nuance → we-at-Product → number" comments on &lt;em&gt;fourteen&lt;/em&gt; different articles in a couple of weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Different "people" reuse the same connective tissue.&lt;/strong&gt; I counted 4-grams that appear across &lt;em&gt;distinct&lt;/em&gt; accounts. Humans almost never echo each other's exact phrasing. These did:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x13 distinct accounts:  "exactly the kind of"
 x8 distinct accounts:  "is exactly the kind"
 x7 distinct accounts:  "this is exactly the"
 x6 distinct accounts:  "is the part that"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"This is exactly the kind of thing that…" is a &lt;em&gt;generative&lt;/em&gt; construction — it's how an LLM hedges into a confident-sounding addition. Thirteen different strangers don't independently converge on it. One model behind thirteen masks does.&lt;/p&gt;

&lt;p&gt;Across the whole sweep, 11 accounts left long product plugs, 32 opened with phrase-validation, and 4 ran the full skeleton. It's not my imagination, and it's not just my post. It's the ambient texture of the platform now.&lt;/p&gt;




&lt;h2&gt;
  
  
  I'd been calling this the wrong thing
&lt;/h2&gt;

&lt;p&gt;I went in thinking "bots." What I'd actually walked into is two older ideas fusing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead Internet Theory&lt;/strong&gt; — the half-joke that the web "died" and is now mostly bots and generated text talking to itself — has stopped being a joke. Hal Berghel makes the serious version of the case in &lt;em&gt;IEEE Computer&lt;/em&gt; (&lt;a href="https://doi.org/10.1109/MC.2025.3616665" rel="noopener noreferrer"&gt;"Generative AI Is Breathing New Life Into the Dead Internet Theory"&lt;/a&gt;, 2026): strip the conspiracy, and the lean core — synthetic content drowning out and being mistaken for humans — just &lt;em&gt;converges with what's measurable&lt;/em&gt;. Imperva clocked &lt;a href="https://www.imperva.com/blog/2025-imperva-bad-bot-report-how-ai-is-supercharging-the-bot-threat/" rel="noopener noreferrer"&gt;automated traffic at 51% of the web in 2024&lt;/a&gt;, the first time bots crossed half. Even Sam Altman &lt;a href="https://time.com/7316046/sam-altman-dead-internet-theory/" rel="noopener noreferrer"&gt;said it out loud&lt;/a&gt;: the wave of AI activity makes dead-internet theory feel real.&lt;/p&gt;

&lt;p&gt;The other half is the &lt;strong&gt;Cozy Web&lt;/strong&gt;. Venkatesh Rao coined the term; Maggie Appleton &lt;a href="https://maggieappleton.com/cozy-web" rel="noopener noreferrer"&gt;diagrammed it&lt;/a&gt; alongside Yancey Strickler's "dark forest": humans fleeing the bot-infested public square into private rooms — group chats, Discords, DMs. Appleton's follow-up, &lt;a href="https://maggieappleton.com/forest-talk" rel="noopener noreferrer"&gt;"The Expanding Dark Forest and Generative AI"&lt;/a&gt;, nails the mechanism: generative AI &lt;em&gt;accelerates&lt;/em&gt; the retreat.&lt;/p&gt;

&lt;p&gt;Here's the part I missed until I saw my own comment section. &lt;strong&gt;These aren't two theories. They're one loop.&lt;/strong&gt; The public web fills with frictionless synthetic text → real people retreat to private rooms → the public spaces that remain (the comment section under my post) get thinner on actual humans → which makes them even easier to fill with synthetic text. My "cozy" thread wasn't a healthy community. It was the calm surface of that loop running.&lt;/p&gt;

&lt;p&gt;And the comment section was already half-empty before the bots arrived. Publications spent the 2010s killing comments — &lt;em&gt;Popular Science&lt;/em&gt; &lt;a href="https://thehistoryoftheweb.com/what-happened-to-the-comment-section/" rel="noopener noreferrer"&gt;in 2013&lt;/a&gt;, and a &lt;a href="https://www.mdpi.com/2673-5172/2/4/34" rel="noopener noreferrer"&gt;peer-reviewed survey of why newsrooms did it&lt;/a&gt; found the conversation had already migrated to social platforms. The robots didn't kill the comment section. They moved into a house that was already mostly vacant.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this actually works (and why I couldn't just tell)
&lt;/h2&gt;

&lt;p&gt;This is the part that unsettled me most, because I pride myself on spotting this stuff, and the research says I shouldn't trust that for a second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humans can't distinguish LLM social text from human text.&lt;/strong&gt; Spitale, Biller-Andorno &amp;amp; Germani showed in &lt;em&gt;Science Advances&lt;/em&gt; (&lt;a href="https://www.science.org/doi/10.1126/sciadv.adh1850" rel="noopener noreferrer"&gt;2023&lt;/a&gt;) that people can't tell GPT tweets from human ones — and rate the AI's information as &lt;em&gt;more&lt;/em&gt; credible. Jones &amp;amp; Bergen found GPT-4 &lt;a href="https://arxiv.org/abs/2405.08007" rel="noopener noreferrer"&gt;passes a controlled Turing test&lt;/a&gt; (taken for human 54% of the time, FAccT 2025).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The persuasion is superhuman when it's personalized.&lt;/strong&gt; Salvi, Ribeiro, Gallotti &amp;amp; West, in &lt;em&gt;Nature Human Behaviour&lt;/em&gt; (&lt;a href="https://www.nature.com/articles/s41562-025-02194-6" rel="noopener noreferrer"&gt;2025&lt;/a&gt;): with a little data about who they're talking to, GPT-4 is &lt;strong&gt;81% more likely than a human&lt;/strong&gt; to win a debate. The Zurich r/changemyview field experiment reportedly found AI replies 3–6× more persuasive than humans — though I'll flag honestly that that study was &lt;strong&gt;withdrawn and never peer-reviewed&lt;/strong&gt;; the only on-record account is the university's &lt;a href="https://retractionwatch.com/2025/04/29/ethics-committee-ai-llm-reddit-changemyview-university-zurich/" rel="noopener noreferrer"&gt;ethics response&lt;/a&gt;. Cite it as a withdrawn preprint, not a result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fake-but-substantive content is, by now, undetectable to people.&lt;/strong&gt; This is the literature closest to my eco-comments. The canonical &lt;a href="https://aclanthology.org/P11-1032/" rel="noopener noreferrer"&gt;Ott et al. (ACL 2011)&lt;/a&gt; already showed humans judge fake reviews at chance. The LLM-era update — Meng et al., &lt;a href="https://arxiv.org/abs/2506.13313" rel="noopener noreferrer"&gt;"Fake Product Reviews are Indistinguishable to Humans and Machines"&lt;/a&gt; (2025) — found people at &lt;strong&gt;50.8%&lt;/strong&gt; (a coin flip) and detectors no better. A promotional plug wearing a sincere technical comment is exactly that, in a new venue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And the detectors fail precisely because of the specifics.&lt;/strong&gt; My detector waved these comments through, and that's not a bug in my code — it's the field. Krishna et al. (&lt;em&gt;NeurIPS 2023&lt;/em&gt;) showed &lt;a href="https://arxiv.org/abs/2303.13408" rel="noopener noreferrer"&gt;light paraphrasing collapses DetectGPT from 70.3% to 4.6%&lt;/a&gt; and defeats GPTZero, OpenAI's classifier, and watermarks. Liang et al. (&lt;em&gt;Patterns 2023&lt;/em&gt;) showed detectors are &lt;a href="https://arxiv.org/abs/2304.02819" rel="noopener noreferrer"&gt;biased against non-native English writers&lt;/a&gt; and bypassable by prompting. The "real technical detail" that made these comments feel human is the &lt;em&gt;same mechanism&lt;/em&gt; that blinds the detector. Specificity isn't proof of a human. It's camouflage.&lt;/p&gt;

&lt;p&gt;So the honest position isn't "I caught the bots." It's: &lt;strong&gt;the tools that would let me be sure don't work, and the research says they can't.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  I modeled what it does to a thread
&lt;/h2&gt;

&lt;p&gt;If I can't reliably catch individual comments, I can at least ask: what does rising automation &lt;em&gt;do&lt;/em&gt; to a conversation, statistically? So I built a toy. (&lt;a href="https://github.com/P0rt/the_cozy_web/blob/main/dead_internet_sim.py" rel="noopener noreferrer"&gt;&lt;code&gt;dead_internet_sim.py&lt;/code&gt;&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;I didn't simulate language — I simulated its statistics, because my thesis is statistical. Each comment is a bag of tokens from two pools: a big, fat-tailed &lt;strong&gt;human&lt;/strong&gt; vocabulary (where the typos, the tangents, the specific war stories live) and a tiny &lt;strong&gt;cozy&lt;/strong&gt; vocabulary of phatic praise. Each comment has an &lt;em&gt;assist level&lt;/em&gt; α from 0 (I typed this, annoyed) to 1 (an agent posts for me, I never read the thread). As α rises, more tokens come from the cozy pool and the comment's stance gets pulled from "disagree" toward "agree."&lt;/p&gt;

&lt;p&gt;Then I swept a whole community's &lt;em&gt;average&lt;/em&gt; autonomy from 0 → 1 and watched the thread's "liveness" — lexical diversity, disagreement, surprise, and a composite index that dies if &lt;em&gt;any&lt;/em&gt; of those hits zero.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jg8z7x21qslkgomqj5s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jg8z7x21qslkgomqj5s.png" alt="Liveness vs autonomy" width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two things fall out, and both match what I saw on my own post:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It's not linear — there's a knee around 0.65.&lt;/strong&gt; You don't need a botnet. You need the &lt;em&gt;average&lt;/em&gt; commenter to be two-thirds on the assist dial, and the thread becomes a smooth surface: polite, "engaged," contributing almost no new information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disagreement dies first&lt;/strong&gt; (the steep red line). The very first thing automation sands off is friction — the "actually, you benchmarked this wrong" energy. Which is &lt;em&gt;exactly&lt;/em&gt; why my comment section felt so nice. It didn't get kinder. It got conflict-free, and I'd been reading conflict-free as kind.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A cozy thread even, literally, uses fewer distinct words:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh0mofm8an86fk8a26m1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh0mofm8an86fk8a26m1.png" alt="Effective vocabulary collapse" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Effective vocabulary collapses from ~175 words to ~60 as autonomy maxes out. (Honest wrinkle: at &lt;em&gt;low&lt;/em&gt; autonomy it ticks up slightly — a little assistance adds a register before saturation homogenizes everything. The damage isn't assistance existing. It's assistance &lt;em&gt;dominating&lt;/em&gt;.)&lt;/p&gt;

&lt;p&gt;And here's the detector failure as a picture — it cleanly separates the &lt;em&gt;old&lt;/em&gt; caricature comments, which is useless, because the comments on my post don't look like the left pile anymore:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6s5m1ctcydktwqc8anf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6s5m1ctcydktwqc8anf.png" alt="Coziness histogram" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The line I actually care about isn't "bot vs. human"
&lt;/h2&gt;

&lt;p&gt;I kept wanting a verdict on each account. The research talked me out of it. The useful axis isn't bot-or-not — it's the &lt;strong&gt;autonomy spectrum&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I typed it → spell-check → "polish this" → "write a comment for me" → an agent posts, I never read the thread
   α=0          α≈0.2          α≈0.5             α≈0.8                        α→1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The product account is α≈1.0 — a brand broadcasting. The two-week-old persona spraying fourteen threads is close behind. But a real growth-hacker at α≈0.8 might be genuinely interested, letting a model do the writing and slip in the plug. From the &lt;em&gt;thread's&lt;/em&gt; point of view, it barely matters: either way, the high-entropy human part — the real disagreement, the idiosyncratic detail, the thing that made it a conversation — got outsourced and smoothed away. That's the loss. Not "a bot was here," but "no one staked anything specific."&lt;/p&gt;

&lt;p&gt;There's even a cheerful counter-current I want to be fair about: AI content on the web is large but &lt;a href="https://originality.ai/ai-content-in-google-search-results" rel="noopener noreferrer"&gt;not yet total&lt;/a&gt; (~17–19% of Google's top results in 2025, by an imperfect detector), some sites are &lt;a href="https://www.techdirt.com/2026/02/03/whoops-websites-realize-that-killing-their-comment-sections-was-a-mistake/" rel="noopener noreferrer"&gt;bringing comment sections &lt;em&gt;back&lt;/em&gt;&lt;/a&gt; on the back of AI moderation, and dev.to's supportive culture is a &lt;a href="https://dev.to/code-of-conduct"&gt;real, deliberate choice&lt;/a&gt;, not just an artifact of bots. Even "what % is bots" has &lt;a href="https://arxiv.org/abs/2209.10006" rel="noopener noreferrer"&gt;no agreed answer&lt;/a&gt; — it depends entirely on your detector. The sky isn't falling. It's just getting quieter in a very specific way.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm going to do about my own blog
&lt;/h2&gt;

&lt;p&gt;Not "ban AI" — that's unenforceable (the detectors are biased and gameable) and wrong (a quick polish genuinely helps a non-native writer or a tired one). The lever isn't the &lt;em&gt;level&lt;/em&gt; of assistance. It's whether assistance &lt;strong&gt;crowds out the high-entropy channels&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;I'll reward specificity over positivity.&lt;/strong&gt; A comment that cites line 14, a version number, a counter-benchmark is worth ten that validate my framing. If a platform ranks by "nice," it is literally selecting for the cozy mean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I'll treat disagreement as a feature, not a moderation failure.&lt;/strong&gt; My simulation's clearest result is that friction dies first. A comment culture optimized purely for niceness is optimizing for deadness with extra steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I'll stop asking "was a model involved."&lt;/strong&gt; It's the wrong question, because the answer is "yes, partly, almost always now." The real question is: &lt;em&gt;did a human read the thing and stake some specificity on a real reply?&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Limitations (read this before you @ me — if you're real)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;I can't prove a single account is a bot.&lt;/strong&gt; Everything above is signals — template reuse, account age, product plugs, cross-post spray — not a confession. The honest claim is about &lt;em&gt;aggregate texture&lt;/em&gt;, not any individual.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The simulation is a toy.&lt;/strong&gt; Two token pools and a stance variable are a cartoon of language. The &lt;em&gt;shape&lt;/em&gt; of the collapse is a property of my assumptions as much as reality. It's an argument made precise, not evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My detector is a strawman by design&lt;/strong&gt; — I show it failing on purpose. Don't deploy it; don't deploy anything like it as a gate on real people (see Liang et al. on who gets falsely flagged).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Zurich study is withdrawn&lt;/strong&gt;, and "% of the web is bots/AI" numbers are detector-dependent and shaky. I've tried to lean only on the load-bearing peer-reviewed work and flag the rest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Causation is underdetermined.&lt;/strong&gt; My cozy comments might also reflect good moderation, kind norms, or survivorship (the cranks left for Reddit). AI-mediation is &lt;em&gt;a&lt;/em&gt; driver, not provably &lt;em&gt;the&lt;/em&gt; driver.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The one-line version
&lt;/h2&gt;

&lt;p&gt;My blog didn't get a nicer community. It got an assistant, learned some manners, and stopped saying anything surprising. The internet didn't die — it just outsourced the parts that used to make it a conversation, and called the result "cozy."&lt;/p&gt;

&lt;p&gt;If this post gets a comment that opens by quoting my own framing back at me, adds one tasteful piece of nuance, and mentions a product its account is named after… well. You know what I'm going to check.&lt;/p&gt;




&lt;h3&gt;
  
  
  Run it yourself
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/P0rt/the_cozy_web
&lt;span class="nb"&gt;cd &lt;/span&gt;the_cozy_web
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

python3 dead_internet_sim.py     &lt;span class="c"&gt;# liveness collapse + figures&lt;/span&gt;
python3 coziness_detector.py     &lt;span class="c"&gt;# the heuristic scorer + histogram&lt;/span&gt;
python3 analyze_devto.py         &lt;span class="c"&gt;# tear apart a real dev.to thread (defaults to my distillation post)&lt;/span&gt;
python3 sweep_devto.py           &lt;span class="c"&gt;# the cross-platform template sweep&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Every factual claim links to its source. If you only read two, read Meng et al. on &lt;a href="https://arxiv.org/abs/2506.13313" rel="noopener noreferrer"&gt;why fake reviews are now indistinguishable&lt;/a&gt; and Krishna et al. on &lt;a href="https://arxiv.org/abs/2303.13408" rel="noopener noreferrer"&gt;why the specifics defeat the detector&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>discuss</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Your AI Coding Speedup Is a Loan, Not a Gift — and the Interest Is Coming Due</title>
      <dc:creator>Sergei Parfenov</dc:creator>
      <pubDate>Wed, 03 Jun 2026 13:37:19 +0000</pubDate>
      <link>https://dev.to/p0rt/your-ai-coding-speedup-is-a-loan-not-a-gift-and-the-interest-is-coming-due-2bkd</link>
      <guid>https://dev.to/p0rt/your-ai-coding-speedup-is-a-loan-not-a-gift-and-the-interest-is-coming-due-2bkd</guid>
      <description>&lt;p&gt;There's a number going around that should bother you more than it does: for every dollar companies spend on AI coding tokens, a large chunk goes straight back into fixing the bugs that same AI produced. The speedup is real — I feel it every day, I'm not here to tell you AI coding is fake. But "faster" and "cheaper" are not the same word, and 2026 is the year the bill started arriving.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — AI doesn't give you a productivity gift, it gives you a &lt;em&gt;loan&lt;/em&gt;: speed now, paid back later in debugging, review, and rewrites. Reporting around an Entelligence AI figure puts the "interest" at roughly 44 cents of every token dollar going to fixing AI-generated bugs. The loan is still worth taking — for the right tasks. The trap is spending borrowed time like it's income.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The number
&lt;/h2&gt;

&lt;p&gt;The stat that kicked this off: a widely-shared claim from Entelligence AI, reported across tech press, that companies spend about &lt;strong&gt;44% of their tokens fixing bugs their own AI generated.&lt;/strong&gt; The fuller breakdown making the rounds is even starker — for every $1 of token spend, ~$0.44 goes to bug fixes, ~$0.27 to rewriting AI output, ~$0.11 to review and merge delays. The pitch version: spend $100k on tokens, ~$18k reaches stable production.&lt;/p&gt;

&lt;p&gt;Now — important caveat, because this is exactly the kind of number that goes viral and then turns out to be junk. Entelligence sells reliability tooling, so that figure is self-serving. Treat the precise percentage as marketing until independently replicated. But it doesn't stand alone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CodeRabbit (also self-interested, also worth salting) analyzed ~470 open-source PRs and found AI-generated code produced &lt;strong&gt;~1.7× more issues&lt;/strong&gt; than human code — and a higher share of &lt;em&gt;critical&lt;/em&gt; ones.&lt;/li&gt;
&lt;li&gt;Independent researchers at Singapore Management University concluded in April that AI-generated code can introduce &lt;strong&gt;long-term maintenance costs&lt;/strong&gt; into real projects — no tool to sell.&lt;/li&gt;
&lt;li&gt;Uber reportedly burned its entire 2026 AI budget in four months, with its COO saying the spend was getting "harder to justify" against measurable output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different sources, different incentives, same shape: &lt;strong&gt;the code ships faster, the bugs arrive later, the maintenance compounds.&lt;/strong&gt; That's not a gift. That's a loan.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "loan" is the right metaphor (and "gift" is the dangerous one)
&lt;/h2&gt;

&lt;p&gt;A gift is free. You take it, you're ahead, done. A loan gives you something valuable &lt;em&gt;now&lt;/em&gt; in exchange for an obligation &lt;em&gt;later&lt;/em&gt; — and whether it was smart depends entirely on what you did with the principal and what the interest rate turns out to be.&lt;/p&gt;

&lt;p&gt;The viral framing I keep coming back to is the maintenance argument: if you write code twice as fast but didn't also halve your maintenance cost, you haven't gained anything durable — you've traded a one-time speed boost for a permanent obligation. Velocity on the front end, debt on the back end.&lt;/p&gt;

&lt;p&gt;Here's why the gift framing is actively dangerous: &lt;strong&gt;you book the speedup immediately and visibly&lt;/strong&gt; (PR merged, feature shipped, manager happy), but &lt;strong&gt;you pay the interest later and diffusely&lt;/strong&gt; (a 2am incident, a confusing module nobody can safely change, a security review that finds the thing six months on). The benefit is loud and the cost is quiet — so teams systematically over-borrow, because the books &lt;em&gt;look&lt;/em&gt; like pure profit right up until they don't.&lt;/p&gt;

&lt;p&gt;This is the same structural failure I keep running into in production AI systems generally: the win is the part everyone measures, the cost is the part nobody instruments until it bites.&lt;/p&gt;

&lt;h2&gt;
  
  
  The productivity-perception trap
&lt;/h2&gt;

&lt;p&gt;There's a second number that pairs with the first, and it's the uncomfortable one.&lt;/p&gt;

&lt;p&gt;METR ran a study in 2025 where experienced open-source developers did real tasks with and without AI. The developers &lt;em&gt;believed&lt;/em&gt; AI sped them up by ~20%. Measured, the early result went the other way — they were slower, because the time saved typing got eaten by finding and fixing errors, steering the model, and waiting on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now — I have to be fair about this, because the METR story is more nuanced than the headlines.&lt;/strong&gt; METR's own February 2026 update walks the dramatic version back: they found heavy selection bias (when they tried to re-run it, 30–50% of devs &lt;em&gt;refused to work without AI even for a paid study&lt;/em&gt; — itself a wild finding), and their newer, larger cohort showed roughly a -4% effect with a confidence interval spanning negative to positive. So the honest read isn't "AI makes you 19% slower." It's the softer, harder-to-dismiss version: &lt;strong&gt;the perceived speedup is consistently larger than the measured one.&lt;/strong&gt; People feel 20% faster; the data says somewhere between "a little slower" and "a little faster."&lt;/p&gt;

&lt;p&gt;That gap is the whole problem. If you &lt;em&gt;feel&lt;/em&gt; twice as productive but you're roughly break-even, and meanwhile 44 cents on the dollar is leaking into rework — you will confidently make staffing, deadline, and architecture decisions based on a productivity gain that isn't there. The feeling is the interest rate you can't see.&lt;/p&gt;

&lt;h2&gt;
  
  
  So when is the loan worth taking?
&lt;/h2&gt;

&lt;p&gt;Here's where I part ways with the doomer takes, because I use these tools every day and the answer is obviously not "stop." It's "borrow deliberately." The pattern I've landed on, watching where AI pays off versus where it quietly bills me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good loans (low interest, take them all day):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throwaway and boilerplate&lt;/strong&gt; — scaffolding, config, one-off scripts, glue code. There's no maintenance tail to pay back because the code barely has a future.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code you'd have to look up anyway&lt;/strong&gt; — the API you use twice a year, the regex, the bash incantation. AI replaces the doc-diving, not the thinking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stuff you can fully verify cheaply&lt;/strong&gt; — pure functions with obvious tests, transformations where wrong is immediately visible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bad loans (the interest eats the principal):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core domain logic you'll maintain for years&lt;/strong&gt; — every line is a future obligation, and AI is happy to write code that &lt;em&gt;looks&lt;/em&gt; right and is subtly, expensively wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anything security-sensitive&lt;/strong&gt; — auth, input handling, anything touching secrets. The reported critical-bug skew is worst exactly here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code in a domain you don't understand well enough to review&lt;/strong&gt; — if you can't catch the subtle wrong, you're not reviewing, you're rubber-stamping a loan you can't read the terms of.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dividing line is brutally simple: &lt;strong&gt;how expensive is this code to be wrong, and how cheaply can I verify it's right?&lt;/strong&gt; Cheap-to-verify, low-maintenance → free money, use AI aggressively. Expensive-to-be-wrong, long-lived → that's where the 44% lives, and where "I wrote it twice as fast" is a sentence you'll regret.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one habit that changes the math
&lt;/h2&gt;

&lt;p&gt;If I had to compress it to a single practice: &lt;strong&gt;measure the interest, not just the principal.&lt;/strong&gt; Teams obsessively track AI's upside (lines generated, tickets closed, "tokenmaxxing" leaderboards — Amazon reportedly killed one internal leaderboard after people gamed it by burning tokens for the score). Almost nobody tracks the downside in the same ledger: what fraction of incidents trace to AI-written code, how much review time it consumes, how often it gets rewritten within N weeks.&lt;/p&gt;

&lt;p&gt;Until you put both columns on the same page, every AI speedup looks like pure profit — for exactly the same reason a credit card feels like free money until the statement comes. The tool isn't the problem. Mistaking the loan for income is.&lt;/p&gt;

&lt;p&gt;The speedup is real. Just don't spend it twice.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm genuinely curious where people land on this: in your experience, is AI a net productivity gain once you count the rework — or does the maintenance tail eat it? And has anyone actually put both columns in the same ledger? Would love to see real numbers in the comments, not vibes.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Sources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://techcrunch.com/2026/05/29/coders-are-refusing-to-work-without-ai-and-that-could-come-back-to-bite-them/" rel="noopener noreferrer"&gt;"Coders are refusing to work without AI — and that could come back to bite them,"&lt;/a&gt; TechCrunch (May 2026).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://thenextweb.com/news/developers-refuse-work-without-ai-coding-productivity-paradox" rel="noopener noreferrer"&gt;"Developers won't work without AI anymore. The research says it might be making them worse,"&lt;/a&gt; The Next Web (May 2026).&lt;/li&gt;
&lt;li&gt;METR, &lt;a href="https://metr.org/blog/2026-02-24-uplift-update/" rel="noopener noreferrer"&gt;"We are Changing our Developer Productivity Experiment Design"&lt;/a&gt; (Feb 2026) — the selection-bias update and revised effect size.&lt;/li&gt;
&lt;li&gt;METR, &lt;a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" rel="noopener noreferrer"&gt;"Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"&lt;/a&gt; (Jul 2025) — the original perception-vs-measurement result.&lt;/li&gt;
&lt;li&gt;CodeRabbit AI code-quality analysis, reported via &lt;a href="https://uk.news.yahoo.com/ai-code-bug-filled-mess-150000962.html" rel="noopener noreferrer"&gt;Futurism/Yahoo&lt;/a&gt; (2026).&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I distilled a 7B vision model into a 2B one for screenshots — and the 7B teacher scored worse</title>
      <dc:creator>Sergei Parfenov</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:36:21 +0000</pubDate>
      <link>https://dev.to/p0rt/i-distilled-a-7b-vision-model-into-a-2b-one-for-screenshots-and-the-7b-teacher-scored-worse-3akh</link>
      <guid>https://dev.to/p0rt/i-distilled-a-7b-vision-model-into-a-2b-one-for-screenshots-and-the-7b-teacher-scored-worse-3akh</guid>
      <description>&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/P0rt/vlm-distill-screenshots" rel="noopener noreferrer"&gt;https://github.com/P0rt/vlm-distill-screenshots&lt;/a&gt; &lt;br&gt;
&lt;strong&gt;Model:&lt;/strong&gt; &lt;a href="https://huggingface.co/p00rt/qwen2-vl-2b-screenshots-distill" rel="noopener noreferrer"&gt;https://huggingface.co/p00rt/qwen2-vl-2b-screenshots-distill&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's a question I keep coming back to whenever someone ships a giant model: &lt;em&gt;what would I lose if I used something 3× smaller?&lt;/em&gt; Not in the abstract — for &lt;strong&gt;my&lt;/strong&gt; task, on &lt;strong&gt;my&lt;/strong&gt; hardware, with numbers I measured myself.&lt;/p&gt;

&lt;p&gt;So I ran the experiment. I took a 7B vision‑language model (VLM), used it as a teacher to teach a 2B student one narrow skill — &lt;strong&gt;describing UI screenshots&lt;/strong&gt; — and then measured exactly what the trade changed: quality, latency, throughput, memory. The whole thing runs on a single MacBook Pro (M4 Pro, 24 GB).&lt;/p&gt;

&lt;p&gt;This post is the honest write‑up: the method, the numbers, and — maybe more useful — the three or four places where reality didn't cooperate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; The distilled 2B student runs &lt;strong&gt;~2.4× faster&lt;/strong&gt;, in &lt;strong&gt;~2.4× less memory&lt;/strong&gt;, with &lt;strong&gt;3.75× fewer parameters&lt;/strong&gt; than the 7B teacher, and it clearly beats the &lt;em&gt;untrained&lt;/em&gt; 2B baseline on the task. The genuinely surprising part: on ROUGE‑L the &lt;strong&gt;2B student scored &lt;em&gt;higher&lt;/em&gt; than the 7B teacher&lt;/strong&gt; — which is a story about the metric, not the models, and turned out to be the most interesting thing I learned. (It's also the live exception to "a student is bounded by its teacher" that I argued about in the comments of &lt;a href="https://dev.to/p0rt/how-model-distillation-actually-works-and-what-the-china-distilled-our-model-headlines-really-3o0o"&gt;my last distillation post&lt;/a&gt; — on a narrow slice, the student really can pull ahead.)&lt;/p&gt;


&lt;h2&gt;
  
  
  Why distill a VLM for a &lt;em&gt;narrow&lt;/em&gt; domain?
&lt;/h2&gt;

&lt;p&gt;The obvious objection to this whole project: "Qwen2‑VL‑2B already exists and it's good — just use it."&lt;/p&gt;

&lt;p&gt;True. But "a good general small VLM" and "a small VLM that's &lt;em&gt;reliably&lt;/em&gt; good at the one thing you need" are different products. Distillation is how you turn the first into the second: you let a stronger model define the target behavior on &lt;strong&gt;your&lt;/strong&gt; data distribution, and the small model adopts it — no manual labelling on your side.&lt;/p&gt;

&lt;p&gt;And distilling a &lt;em&gt;vision‑language&lt;/em&gt; model is less‑travelled territory than the classic "distill BERT into something tiny" story. It drags in real inference engineering — 4‑bit teachers, LoRA, quantized runtimes, memory budgets — and that engineering is half of why it's worth writing up.&lt;/p&gt;

&lt;p&gt;The task I picked is deliberately narrow: &lt;strong&gt;screenshot understanding&lt;/strong&gt;. Given a UI screenshot, produce a one‑sentence summary plus a list of the key interface elements. Perception only — no clicking, no agent. (That's future work; more at the end.)&lt;/p&gt;


&lt;h2&gt;
  
  
  The setup: task, data, metrics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data — &lt;a href="https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words" rel="noopener noreferrer"&gt;Screen2Words&lt;/a&gt;&lt;/strong&gt; (&lt;code&gt;rootsautomation/RICO-Screen2Words&lt;/code&gt;, CC‑BY‑4.0): 22,417 Android UI screenshots from the RICO corpus, each with five human‑written summaries. Native splits are train / val / test = 15,743 / 2,364 / 4,310, across 28 app categories.&lt;/p&gt;

&lt;p&gt;One detail that matters more than it looks: the human captions are &lt;strong&gt;short&lt;/strong&gt; — median &lt;strong&gt;7 words&lt;/strong&gt;. Hold that thought; it comes back to bite the metrics.&lt;/p&gt;

&lt;p&gt;I picked the &lt;code&gt;rootsautomation&lt;/code&gt; mirror specifically because it's &lt;strong&gt;CC‑BY‑4.0&lt;/strong&gt; — publishable, unlike raw RICO's research‑only terms. I check the license &lt;em&gt;before&lt;/em&gt; I push weights, not after. (It's in the model card.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metrics.&lt;/strong&gt; ROUGE‑L and BLEU against the human references, plus an optional teacher‑as‑judge score. I implemented ROUGE‑L and BLEU from scratch (pure Python, multi‑reference, unit‑tested) so the numbers are deterministic and dependency‑free. CIDEr — the classic captioning metric — needs corpus‑level document frequencies; I left it as a follow‑up rather than pull in a heavy dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pipeline&lt;/strong&gt;, end to end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;download → build_dataset → teacher_label → train → eval → benchmark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every stage is a typed CLI step, all hyperparameters live in &lt;code&gt;configs/*.yaml&lt;/code&gt;, and the heavy steps (labelling, training) are resumable and versioned by a config hash. I built it phase by phase, one green‑CI PR at a time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Method: three signals, one MVP
&lt;/h2&gt;

&lt;p&gt;Knowledge distillation for generation has a few flavors, and I wanted the harness to support all of them behind flags so I could ablate cleanly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Response‑based KD&lt;/strong&gt; — the teacher generates answers, the student learns to reproduce them. The full objective mixes a soft and a hard target:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   L = α · CE(student, hard_labels)
       + (1 − α) · T² · KL( softmax(teacher/T) ‖ log_softmax(student/T) )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Feature‑based&lt;/strong&gt; — align the student's vision features with the teacher's.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self‑distillation&lt;/strong&gt; — let the teacher label extra screenshots to grow the data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The MVP I actually trained is the &lt;strong&gt;hard‑target half of (1): sequence‑level distillation.&lt;/strong&gt; The teacher writes a description; the student is fine‑tuned (LoRA) to reproduce that &lt;em&gt;text&lt;/em&gt;. No teacher logits needed at train time, which keeps the whole thing laptop‑friendly.&lt;/p&gt;

&lt;p&gt;The soft‑KL term is implemented and unit‑tested (&lt;code&gt;response_kd_loss&lt;/code&gt;), but wiring it into training needs cached teacher logits — and that's exactly the α/temperature ablation axis I &lt;em&gt;couldn't&lt;/em&gt; run yet. I'd rather say that out loud than fake it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Teacher labelling
&lt;/h2&gt;

&lt;p&gt;The teacher is &lt;strong&gt;Qwen2‑VL‑7B‑Instruct in 4‑bit&lt;/strong&gt;, running through MLX on Apple Silicon. (&lt;code&gt;bitsandbytes&lt;/code&gt; is CUDA‑only, so the usual 4‑bit path doesn't exist on a Mac — MLX is the way in.)&lt;/p&gt;

&lt;p&gt;It labelled 200 training screenshots at &lt;strong&gt;~10.2 s/screenshot&lt;/strong&gt; (≈34 minutes; ≈2.7 hours projected for the full 15.7k split). Zero outputs were flagged degenerate by a light post‑validation pass (whitespace normalization + empty/too‑short detection — cheap insurance against format drift), and the mean target length was 33.6 words. A real example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The UI screenshot shows a fitness app displaying an exercise called 'Lunges,' with a progress indicator showing 30% complete. Key interface elements include a progress bar, a figure performing the exercise, and the text 'Lunges.'"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notice it's &lt;strong&gt;33 words&lt;/strong&gt; and genuinely rich. The human reference for screens like this is more like &lt;em&gt;"exercise screen"&lt;/em&gt;. That gap is the whole story of the metrics section below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training the student (and the part where MLX said no)
&lt;/h2&gt;

&lt;p&gt;Plan A was to train the LoRA adapter with MLX too — same runtime as the teacher, fast on Apple Silicon. Plan A died in the backward pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: [Primitive::vjp] Not implemented for CustomKernel.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;mlx‑vlm 0.6.0 can't backprop through one of Qwen2‑VL's custom Metal kernels. I checked the usual suspects — &lt;code&gt;scaled_dot_product_attention&lt;/code&gt;, RMSNorm, RoPE all &lt;em&gt;do&lt;/em&gt; have gradients — so it's a specific kernel, and both &lt;code&gt;mlx&lt;/code&gt; and &lt;code&gt;mlx-vlm&lt;/code&gt; were already on their latest release, so there was no version to bump to. MLX stays a great &lt;strong&gt;inference&lt;/strong&gt; backend here; it just can't train this model yet.&lt;/p&gt;

&lt;p&gt;Plan B: train on the &lt;strong&gt;&lt;code&gt;hf&lt;/code&gt; path (transformers + PEFT LoRA) on Apple MPS&lt;/strong&gt;, with &lt;code&gt;PYTORCH_ENABLE_MPS_FALLBACK=1&lt;/code&gt;. That worked. Two more small potholes on the way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The screenshots are too tall.&lt;/strong&gt; Qwen2‑VL expands a big RICO screenshot into &lt;em&gt;thousands&lt;/em&gt; of vision tokens; with a 1k context that overflows and throws a broadcast‑shape error deep in &lt;code&gt;get_rope_index&lt;/code&gt;. Fix: cap the visual‑token budget (&lt;code&gt;max_pixels&lt;/code&gt;) so an image stays well under the context window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A papercut:&lt;/strong&gt; recent &lt;code&gt;transformers&lt;/code&gt; pulls in a Qwen2‑VL &lt;em&gt;video&lt;/em&gt; processor that needs &lt;code&gt;torchvision&lt;/code&gt; — which I hadn't installed. Easy to miss until the first run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After that it trained cleanly: loss &lt;strong&gt;0.80 → 0.39&lt;/strong&gt; over 40 steps, and — the part that matters for "is the checkpoint real" — I reloaded the merged adapter and it generated in the trained format ("…Key interface elements include…"). On a laptop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: the honest version
&lt;/h2&gt;

&lt;p&gt;First, the caveat that frames everything below, because it's load‑bearing: &lt;strong&gt;this is a deliberately small proof‑of‑concept.&lt;/strong&gt; The quality numbers come from short training runs and a tiny eval set (80 train / 16–100 test examples depending on the run). Treat them as &lt;em&gt;trends and a working method&lt;/em&gt;, not a benchmark. What I'm confident in is the harness and the measurement; the absolute numbers want a full‑scale run before anyone quotes them. With that said —&lt;/p&gt;

&lt;p&gt;Here's the quality table on the test split (ROUGE‑L / BLEU vs the human references):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;model&lt;/th&gt;
&lt;th&gt;ROUGE‑L&lt;/th&gt;
&lt;th&gt;BLEU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;teacher (7B)&lt;/td&gt;
&lt;td&gt;0.164&lt;/td&gt;
&lt;td&gt;0.000 †&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;student (2B + LoRA)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.178&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;baseline (2B, untrained)&lt;/td&gt;
&lt;td&gt;0.153&lt;/td&gt;
&lt;td&gt;0.018&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;† &lt;em&gt;Teacher BLEU rounds to 0.000 — that's not a bug, it's the length mismatch explained right below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Read that twice, because it surprised me too: &lt;strong&gt;the 7B teacher scores &lt;em&gt;lower&lt;/em&gt; on ROUGE‑L than the 2B student, and its BLEU is essentially zero.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's not the teacher being bad — it's the metric. The teacher writes 33‑word descriptions; the human references are 7 words. BLEU rewards exact n‑gram overlap, so a rich, correct, &lt;em&gt;long&lt;/em&gt; answer against a terse reference scores ~0. ROUGE‑L (longest common subsequence) is kinder but still favors brevity‑matching. So against short references, all three models cluster in a narrow band and the verbose teacher actually looks &lt;em&gt;worse&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The honest takeaways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Distillation helped:&lt;/strong&gt; the student (trained on teacher outputs) beats the untrained baseline, +16% relative ROUGE‑L. That's the comparison that's actually apples‑to‑apples (same model, same speed).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;These metrics undersell rich outputs.&lt;/strong&gt; This is exactly why LLM‑as‑judge and CIDEr exist, and why I flag the ROUGE‑L/BLEU numbers as a floor, not a verdict.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The clean, unambiguous win is efficiency&lt;/strong&gt; — so let's go there.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The trade‑off (the actual point)
&lt;/h2&gt;

&lt;p&gt;Same hardware, same 4‑bit setup, 128‑token generations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;model&lt;/th&gt;
&lt;th&gt;params (B)&lt;/th&gt;
&lt;th&gt;latency p50 (ms)&lt;/th&gt;
&lt;th&gt;throughput (img/s)&lt;/th&gt;
&lt;th&gt;peak mem (GB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;teacher (Qwen2‑VL‑7B)&lt;/td&gt;
&lt;td&gt;8.29&lt;/td&gt;
&lt;td&gt;1538&lt;/td&gt;
&lt;td&gt;0.63&lt;/td&gt;
&lt;td&gt;5.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;student (Qwen2‑VL‑2B)&lt;/td&gt;
&lt;td&gt;2.21&lt;/td&gt;
&lt;td&gt;651&lt;/td&gt;
&lt;td&gt;1.52&lt;/td&gt;
&lt;td&gt;2.4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;~2.4× faster, in ~2.4× less memory, with 3.75× fewer parameters.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrqe8bj33mfzdq3qr27y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrqe8bj33mfzdq3qr27y.png" alt="Quality vs speed" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2trucj6urbfr6780cawk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2trucj6urbfr6780cawk.png" alt="Quality vs memory" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The student sits in the friendly corner: as fast and light as the untrained baseline, but with the distilled quality bump on top. The teacher is off to the slow, heavy side — and, per the metrics caveat above, not even ahead on ROUGE‑L. The 2B model is the one I'd actually deploy for this task.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Honesty box.&lt;/strong&gt; "Peak memory" on Apple Silicon is unified‑memory allocation, not CUDA VRAM. The headline efficiency numbers are MLX/4‑bit on Apple Silicon, not a server GPU. As flagged above, the quality numbers are a small proof‑of‑concept — trends, not a benchmark result. The thing I'm confident in is the &lt;strong&gt;method and the measurement harness&lt;/strong&gt; — both reproducible from the repo.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Ablations: what actually moved quality
&lt;/h2&gt;

&lt;p&gt;I varied the two knobs my sequence‑level SFT exposes — training steps and LoRA rank — at fixed everything‑else, and re‑evaluated each:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;run&lt;/th&gt;
&lt;th&gt;LoRA r&lt;/th&gt;
&lt;th&gt;steps&lt;/th&gt;
&lt;th&gt;ROUGE‑L&lt;/th&gt;
&lt;th&gt;BLEU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0.152&lt;/td&gt;
&lt;td&gt;0.017&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;0.170&lt;/td&gt;
&lt;td&gt;0.018&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;0.172&lt;/td&gt;
&lt;td&gt;0.020&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;0.171&lt;/td&gt;
&lt;td&gt;0.019&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4w9s2msldit4umpchgve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4w9s2msldit4umpchgve.png" alt="Ablation: steps vs ROUGE-L" width="800" height="568"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"Train at all" is the dominant lever&lt;/strong&gt; — the baseline → distilled jump is by far the biggest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More steps help marginally&lt;/strong&gt;, and the gain shows up more on BLEU (exact phrasing sharpens) than on ROUGE‑L.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LoRA rank is ~neutral&lt;/strong&gt; here — r8 ≈ r16. At this data scale, adapter &lt;em&gt;capacity&lt;/em&gt; isn't the bottleneck, so r8 is plenty.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The α/temperature/feature‑alignment ablations from the method section belong to the &lt;strong&gt;logit‑level&lt;/strong&gt; KD variant, which needs cached teacher logits I haven't produced yet. Three honest comparisons beat six fabricated ones.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inference engineering (the half that bites)
&lt;/h2&gt;

&lt;p&gt;The modelling is the easy part. The engineering is where the hours went:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two backends, one interface.&lt;/strong&gt; The teacher runs 4‑bit via &lt;code&gt;mlx-vlm&lt;/code&gt;; the student trains/infers via &lt;code&gt;transformers&lt;/code&gt; + PEFT. A small factory (&lt;code&gt;make_teacher&lt;/code&gt; / &lt;code&gt;make_student&lt;/code&gt;) hides which one you're on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLX can't train this model yet&lt;/strong&gt; (the &lt;code&gt;CustomKernel&lt;/code&gt; vjp gap above) — so training is &lt;code&gt;hf&lt;/code&gt;/MPS, inference can be either.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't mix runtimes in one process.&lt;/strong&gt; Evaluating an MLX teacher and a torch student in the &lt;em&gt;same&lt;/em&gt; Python process conflicts on Apple Silicon (&lt;code&gt;'array' object has no attribute 'device'&lt;/code&gt;). Run them as separate invocations. Found that one the fun way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual‑token budget is a real knob&lt;/strong&gt; — too many tokens per screenshot and you blow the context window; cap &lt;code&gt;max_pixels&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ONNX export&lt;/strong&gt; of a full VLM is famously finicky, so I kept torch/MLX inference as the canonical path and shipped a &lt;strong&gt;merge‑and‑save&lt;/strong&gt; export instead: fold the LoRA into the base weights and write a standalone 2B student you can load anywhere with plain &lt;code&gt;transformers&lt;/code&gt;. (ONNX stays a documented stretch goal.)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What the metrics miss (and what I'd do next)
&lt;/h2&gt;

&lt;p&gt;The most useful thing this project taught me wasn't a number — it was &lt;em&gt;which numbers to distrust&lt;/em&gt;. ROUGE‑L and BLEU against terse human references genuinely undersell a model that writes richer, correct descriptions. If I were taking this past proof‑of‑concept, the very next step would be &lt;strong&gt;LLM‑as‑judge scoring&lt;/strong&gt; (the harness already supports it) and &lt;strong&gt;CIDEr&lt;/strong&gt;, both of which reward content over brevity‑matching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest limitations:&lt;/strong&gt; small scale (short training, tiny eval N); narrow domain (RICO Android UI); BLEU is low for &lt;em&gt;everyone&lt;/em&gt; because of the length mismatch; and the headline efficiency numbers are MLX/4‑bit on Apple Silicon, not a server GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future work:&lt;/strong&gt; cache teacher logits and turn on the soft‑KL term (and finally run the α/T ablation); add feature alignment; grow the data with teacher‑labelled RICO; a full‑scale run on a 24 GB GPU; and the natural next domain step — &lt;strong&gt;grounding&lt;/strong&gt; (bounding boxes) and an &lt;strong&gt;agent wrapper&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reproduce it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/P0rt/vlm-distill-screenshots &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;vlm-distill-screenshots
uv &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="nt"&gt;--extra&lt;/span&gt; data            &lt;span class="c"&gt;# data stack&lt;/span&gt;
uv run vlm-build-dataset        &lt;span class="c"&gt;# Screen2Words → unified {image, prompt, target}&lt;/span&gt;

uv &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="nt"&gt;--extra&lt;/span&gt; mlx             &lt;span class="c"&gt;# Apple Silicon teacher&lt;/span&gt;
uv run vlm-teacher-label &lt;span class="nt"&gt;--limit&lt;/span&gt; 200

uv &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="nt"&gt;--extra&lt;/span&gt; ml              &lt;span class="c"&gt;# transformers + peft&lt;/span&gt;
&lt;span class="nv"&gt;PYTORCH_ENABLE_MPS_FALLBACK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 uv run vlm-train &lt;span class="nt"&gt;--limit&lt;/span&gt; 200
uv run vlm-eval &lt;span class="nt"&gt;--models&lt;/span&gt; student,baseline &lt;span class="nt"&gt;--adapter&lt;/span&gt; results/checkpoints/&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;--limit&lt;/span&gt; 100
uv run vlm-benchmark &lt;span class="nt"&gt;--models&lt;/span&gt; teacher,student
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything — configs, the metric implementations, the plots, this article — is in the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/P0rt/vlm-distill-screenshots" rel="noopener noreferrer"&gt;https://github.com/P0rt/vlm-distill-screenshots&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model + card:&lt;/strong&gt; &lt;a href="https://huggingface.co/p00rt/qwen2-vl-2b-screenshots-distill" rel="noopener noreferrer"&gt;https://huggingface.co/p00rt/qwen2-vl-2b-screenshots-distill&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've done VLM distillation and have a take on metrics that actually reward rich descriptions, I'd love to hear it in the comments.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>llm</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Your AI Agent Isn't Failing Because It Hallucinates — It's Failing Because of Rate Limits</title>
      <dc:creator>Sergei Parfenov</dc:creator>
      <pubDate>Tue, 02 Jun 2026 13:09:00 +0000</pubDate>
      <link>https://dev.to/p0rt/your-ai-agent-isnt-failing-because-it-hallucinates-its-failing-because-of-rate-limits-2d60</link>
      <guid>https://dev.to/p0rt/your-ai-agent-isnt-failing-because-it-hallucinates-its-failing-because-of-rate-limits-2d60</guid>
      <description>&lt;p&gt;When my agents started failing in production, I did what everyone does first: I went hunting for hallucinations. Better prompts, tighter output schemas, more guardrails. None of it moved the needle, because I was debugging the wrong layer. The agent's reasoning was fine. It was the &lt;em&gt;plumbing&lt;/em&gt; that kept collapsing — and the single biggest culprit was the most boring thing imaginable: rate limits.&lt;/p&gt;

&lt;p&gt;This turns out not to be just my problem. It's the dominant production failure mode for LLM applications right now, and almost nobody talks about it because it doesn't make for a good demo.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — In production, the thing that takes your agent down usually isn't bad reasoning — it's capacity. Provider rate limits are now one of the largest sources of LLM call errors in real traces. A demo makes one request at a time; a production agent fans out into dozens of chained, retrying, concurrent calls and slams into limits the demo never touched. The fix isn't a smarter model, it's capacity engineering: budgeting, backpressure, retries with jitter, fallback models, and caching.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The data nobody puts in the pitch deck
&lt;/h2&gt;

&lt;p&gt;Here's the number that reframed how I think about agent reliability. In Datadog's analysis of real LLM observability traces, rate-limit errors were a &lt;em&gt;huge&lt;/em&gt; share of all LLM call failures — in March 2026, roughly a third of all LLM span errors were rate limits, on the order of millions of individual errors. Their conclusion was blunt: when the dominant failure mode of your LLM application is capacity, you need to redouble your &lt;em&gt;capacity engineering&lt;/em&gt;, not your prompt engineering.&lt;/p&gt;

&lt;p&gt;Sit with that. The failure mode isn't the model being dumb. It's the model provider saying "too many requests" — and your agent having no plan for that answer.&lt;/p&gt;

&lt;p&gt;It maps almost perfectly onto the broader "agents fail in production" story everyone's writing about. The reason demos lie isn't malice; it's structural. A demo runs one clean request, one user, one happy path. Production is concurrency, retries, fan-out, and load — the exact conditions that manufacture rate-limit errors. The gap between "works in a notebook" and "works at 3am under load" is, more often than people admit, a capacity gap wearing a reliability costume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why agents hit this wall harder than chatbots
&lt;/h2&gt;

&lt;p&gt;A plain chatbot makes one API call per user turn. An agent is a different beast. A single "task" expands into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A planning call.&lt;/li&gt;
&lt;li&gt;N tool-selection calls as it loops.&lt;/li&gt;
&lt;li&gt;A call per tool result to decide the next step.&lt;/li&gt;
&lt;li&gt;Retries on each of those when something is flaky.&lt;/li&gt;
&lt;li&gt;Often a sub-agent or two, each with its own loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So one user action becomes 10–40 model calls, frequently &lt;em&gt;concurrent&lt;/em&gt;, frequently &lt;em&gt;retrying&lt;/em&gt;. The multiplier is the whole point of agents — and it's also exactly what walks you into a rate limit. Worse, the naive failure response makes it catastrophic: a call gets a 429, the framework retries immediately, that retry also gets a 429, and now you've turned one rate-limit error into a retry storm that takes the whole task down.&lt;/p&gt;

&lt;p&gt;The arithmetic is unforgiving once you write it out. Say your provider gives you 500 requests/minute. If each agent task fans out to ~20 model calls, then just &lt;strong&gt;25 concurrent tasks&lt;/strong&gt; saturate your entire quota — and that's before a single retry. Add naive immediate retries on the resulting 429s and you don't degrade gracefully, you spike straight through the ceiling. I've watched this pattern play out more than once, and every time the first instinct in the room is "the model is broken" — when the model never even ran.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkcw0usb2uaz7iit0canq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkcw0usb2uaz7iit0canq.png" alt="One user action fans out into 10–40 concurrent model calls that all draw from one fixed provider quota; naive retries turn a single 429 into a storm, while a limiter with backoff keeps calls under the ceiling." width="800" height="534"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is also where serverless bites you specifically. On Cloud Run, a traffic spike spins up new instances happily — compute scales fine. But your LLM provider quota does &lt;em&gt;not&lt;/em&gt; scale with your container count. So autoscaling does the worst possible thing: it lets more concurrent agents launch, each firing its call fan-out, all drawing from the same fixed provider quota, all hitting the ceiling at once. The platform that's supposed to absorb load becomes the thing that amplifies it into the rate limiter. It's a genuinely counterintuitive failure: the healthier your autoscaling looks on the compute dashboard, the harder you're hammering a quota that can't scale with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The capacity-engineering toolkit
&lt;/h2&gt;

&lt;p&gt;None of the fixes are exotic. They're the same patterns distributed-systems people have used for decades — they just haven't migrated into most agent codebases yet, because the field grew up on prompt-craft, not ops. Here's what actually moved my reliability numbers.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Budget and backpressure, don't just retry
&lt;/h3&gt;

&lt;p&gt;The instinct is to retry harder. The fix is to &lt;em&gt;send less&lt;/em&gt;. Put a concurrency limiter (a semaphore / token bucket) in front of all outbound model calls so your app never exceeds your known provider quota in the first place. When the budget is full, queue — don't fire-and-retry. This single change does more than any retry tuning, because it prevents the storm instead of recovering from it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="c1"&gt;# Cap concurrent in-flight calls below your provider's actual limit.
# Leave headroom — you are NOT the only caller against this quota.
&lt;/span&gt;&lt;span class="n"&gt;sem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;sem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Retry with exponential backoff &lt;em&gt;and jitter&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;When you do retry, never retry immediately, and never retry in lockstep. Synchronized retries from many workers create a thundering herd that re-triggers the limit. Exponential backoff with random jitter spreads them out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;with_backoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;RateLimitError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="c1"&gt;# exponential + full jitter
&lt;/span&gt;            &lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Respect the &lt;code&gt;Retry-After&lt;/code&gt; header if the provider sends one — it's telling you exactly how long to wait, which beats guessing.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Fallback model, not just failure
&lt;/h3&gt;

&lt;p&gt;Tie this back to distillation thinking: you don't need your frontier model for every call. Route to a cheaper/secondary model (a different provider, or a smaller model on a separate quota) when the primary is rate-limited. A degraded answer beats a dead task, and you've spread load across two quota pools instead of hammering one. This is the same hybrid pattern as keeping a cheap student model for the easy 90% and falling back to an expensive teacher — just applied to &lt;em&gt;availability&lt;/em&gt; instead of &lt;em&gt;capability&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cache aggressively
&lt;/h3&gt;

&lt;p&gt;A surprising fraction of agent calls are near-duplicate: the same tool descriptions, the same system context, the same sub-queries across runs. Prompt/response caching and reusing provider-side prompt caching cuts the call volume that reaches the limiter at all. The cheapest rate-limit error is the request you never sent.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Make capacity observable
&lt;/h3&gt;

&lt;p&gt;You can't engineer what you can't see. The reason rate limits blindside teams is that they show up as generic "agent failed" errors, not as a labeled capacity problem. Log the error &lt;em&gt;class&lt;/em&gt; (429 vs timeout vs tool error), track your in-flight concurrency and your 429-rate as first-class metrics, and alert on them. The shift that mattered most for me was simply separating "the model was wrong" from "the provider said no" in the telemetry — until you do that, every failure looks like a reasoning bug, and you keep fixing the wrong layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model shift
&lt;/h2&gt;

&lt;p&gt;The thing I'd tell my past self: &lt;strong&gt;treat your LLM provider quota as a shared, finite, non-scaling resource — like a database connection pool, not like CPU.&lt;/strong&gt; Compute scales elastically. Your token-per-minute and request-per-minute quotas do not. Once you internalize that, agent reliability stops looking like an AI problem and starts looking like a classic distributed-systems capacity problem — which is great news, because we already know how to solve those.&lt;/p&gt;

&lt;p&gt;Smarter models won't save you here. A GPT-6 that reasons perfectly still returns 429 when you exceed your quota. The reliability frontier for agents in 2026 isn't intelligence — it's capacity engineering.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're running agents in production, I'm curious what your dominant failure mode actually is when you separate the error classes — reasoning, capacity, or tool integration? My money's increasingly on capacity. Tell me I'm wrong in the comments.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Sources &amp;amp; further reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Datadog, &lt;a href="https://www.datadoghq.com/state-of-ai-engineering/" rel="noopener noreferrer"&gt;"State of AI Engineering"&lt;/a&gt; (2026) — rate-limit errors as a dominant share of LLM call failures in production traces.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.c-sharpcorner.com/article/why-ai-agents-fail-in-production-and-how-engineering-teams-are-fixing-it/" rel="noopener noreferrer"&gt;"Why AI Agents Fail in Production and How Engineering Teams Are Fixing It"&lt;/a&gt;, C# Corner (2026).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/issa_gueye/the-ai-agent-reliability-gap-in-2026-why-the-tooling-is-finally-catching-up-ne3"&gt;"The AI Agent Reliability Gap in 2026"&lt;/a&gt;, DEV Community.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.digitalapplied.com/blog/88-percent-ai-agents-never-reach-production-failure-framework" rel="noopener noreferrer"&gt;"Why 88% of AI Agents Never Reach Production"&lt;/a&gt;, Digital Applied (2026).&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How Model Distillation Actually Works (and What the 'China Distilled Our Model' Headlines Really Mean)</title>
      <dc:creator>Sergei Parfenov</dc:creator>
      <pubDate>Fri, 29 May 2026 12:11:12 +0000</pubDate>
      <link>https://dev.to/p0rt/how-model-distillation-actually-works-and-what-the-china-distilled-our-model-headlines-really-3o0o</link>
      <guid>https://dev.to/p0rt/how-model-distillation-actually-works-and-what-the-china-distilled-our-model-headlines-really-3o0o</guid>
      <description>&lt;p&gt;Every few weeks a headline drops: &lt;em&gt;"Chinese lab distilled a frontier model from OpenAI / Anthropic."&lt;/em&gt; Cue the comments — half the thread thinks distillation is a synonym for theft, the other half thinks it's some exotic Chinese trick.&lt;/p&gt;

&lt;p&gt;Both are wrong. Distillation is one of the most boring, well-established techniques in deep learning, and the labs raising the alarms use it on their own models constantly. The actual controversy is narrower and more interesting than the headlines. Let's separate the engineering from the geopolitics.&lt;/p&gt;

&lt;h2&gt;
  
  
  What distillation actually is
&lt;/h2&gt;

&lt;p&gt;Knowledge distillation trains a small &lt;strong&gt;student&lt;/strong&gt; model to imitate a large &lt;strong&gt;teacher&lt;/strong&gt; model. The classic framing comes from Hinton et al. (2015): instead of training the student only on ground-truth labels, you also train it to match the teacher's output distribution.&lt;/p&gt;

&lt;p&gt;Why does that help? Because the teacher's &lt;em&gt;full probability distribution&lt;/em&gt; carries far more information than the single correct answer. If a teacher classifies an image of a dog, it might output &lt;code&gt;dog: 0.9, wolf: 0.08, cat: 0.001&lt;/code&gt;. That "dog and wolf are similar, cat is not" signal — Hinton called it &lt;strong&gt;dark knowledge&lt;/strong&gt; — is exactly what a small model struggles to learn from hard labels alone.&lt;/p&gt;

&lt;p&gt;There are two kinds of training signal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hard labels&lt;/strong&gt; — the final answer (the token the teacher actually produced, or the ground-truth label).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Soft labels&lt;/strong&gt; — the teacher's full probability distribution over outputs, usually its logits passed through a softmax.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trick is &lt;strong&gt;temperature&lt;/strong&gt;. You divide the logits by a temperature &lt;code&gt;T &amp;gt; 1&lt;/code&gt; before the softmax, which flattens the distribution and exposes those small-but-meaningful probabilities the student should learn from.&lt;/p&gt;

&lt;p&gt;The loss is a blend of two terms: a standard cross-entropy against the real labels, and a KL-divergence pulling the student's softened distribution toward the teacher's.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn.functional&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;distillation_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;teacher_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Standard loss: student vs ground truth (hard labels)
&lt;/span&gt;    &lt;span class="n"&gt;hard_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Distillation loss: student vs teacher's softened distribution (soft labels)
&lt;/span&gt;    &lt;span class="n"&gt;soft_targets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;teacher_logits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;student_log_probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_logits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# T**2 keeps gradient magnitudes balanced when T &amp;gt; 1
&lt;/span&gt;    &lt;span class="n"&gt;soft_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kl_div&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_log_probs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;soft_targets&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reduction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batchmean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;hard_loss&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;soft_loss&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For LLMs the same idea applies per token: the teacher's next-token distribution is the soft target. In practice teams mix hard and soft labels — recent work argues the gain from mixing comes less from "matching the teacher better" and more from reducing &lt;em&gt;exposure bias&lt;/em&gt; (the train/inference distribution mismatch). The point: this is normal, published, peer-reviewed engineering.&lt;/p&gt;

&lt;p&gt;And labs distill their own models all the time. The cheap, fast variant of a flagship model that you actually get to call in production? Very often a distilled student. Anthropic itself, in the middle of its own complaint about Chinese firms, acknowledged that AI companies &lt;em&gt;routinely&lt;/em&gt; distill their own models to make smaller, cheaper versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why distilling from a closed API is a different beast
&lt;/h2&gt;

&lt;p&gt;Here's the part the headlines skip. Everything above assumes you have the teacher's &lt;strong&gt;logits&lt;/strong&gt; — the raw output distribution. That's &lt;strong&gt;white-box distillation&lt;/strong&gt;, and it requires access to the model's internals or at least its full probability outputs.&lt;/p&gt;

&lt;p&gt;You do &lt;strong&gt;not&lt;/strong&gt; get logits from a closed commercial API like Claude or GPT. You get text. That forces &lt;strong&gt;black-box&lt;/strong&gt; (a.k.a. sequence-level) distillation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prompt the teacher with lots of inputs.&lt;/li&gt;
&lt;li&gt;Collect its generated text outputs.&lt;/li&gt;
&lt;li&gt;Build a synthetic dataset of (prompt → teacher answer) pairs.&lt;/li&gt;
&lt;li&gt;Fine-tune your student on that dataset with supervised fine-tuning, often followed by RL.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You lose the dark knowledge in the soft labels, but it turns out you can get remarkably far just by training on a large, high-quality synthetic dataset generated by a strong teacher. This is exactly why "did model X learn from model Y's outputs?" is such a live and hard-to-prove question — the evidence isn't a stolen weights file, it's statistical fingerprints in behavior (a model that randomly claims to &lt;em&gt;be&lt;/em&gt; ChatGPT, mirrors another model's quirks, etc.).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;White-box&lt;/th&gt;
&lt;th&gt;Black-box (closed API)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Needs&lt;/td&gt;
&lt;td&gt;Logits / weights&lt;/td&gt;
&lt;td&gt;Just text outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Signal richness&lt;/td&gt;
&lt;td&gt;High (full distribution)&lt;/td&gt;
&lt;td&gt;Lower (final answers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feasible against a closed model?&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What the China allegations are about&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;This one&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  So what are the actual allegations?
&lt;/h2&gt;

&lt;p&gt;Strip the drama and here's the documented timeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jan 2025&lt;/strong&gt; — After DeepSeek's R1 launch, OpenAI and Microsoft open an investigation into whether DeepSeek used ChatGPT outputs to train it. Users noticed R1 behaving suspiciously ChatGPT-like.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feb 2026&lt;/strong&gt; — OpenAI sends a memo to the U.S. House Select Committee on China alleging DeepSeek used obfuscated third-party routers to access OpenAI models and programmatically extract outputs for distillation, in violation of its terms of service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feb 24, 2026&lt;/strong&gt; — Anthropic publicly accuses three Chinese firms — &lt;strong&gt;DeepSeek, Moonshot AI, and MiniMax&lt;/strong&gt; — of coordinated "distillation attack" campaigns: flooding Claude with crafted prompts, allegedly via commercial proxy services running tens of thousands of accounts to sidestep Anthropic's China access restrictions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two things matter here, and most coverage gets them backwards:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;These are allegations.&lt;/strong&gt; The labs have not, as of writing, published the full underlying evidence, and the accused firms dispute or haven't confirmed them. Behavioral similarity is suggestive, not proof.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The dispute is not "distillation = bad."&lt;/strong&gt; As one ethics researcher put it after Anthropic's statement, if Anthropic itself calls distillation legitimate and widespread, the controversy can't be the technique. It's two narrower things: &lt;strong&gt;unauthorized access&lt;/strong&gt; (using proxies to evade geographic and account restrictions) and &lt;strong&gt;terms-of-service violations&lt;/strong&gt; (most frontier APIs explicitly forbid using outputs to train a competing model). It's closer to a contract-and-access fight than an IP-theft slam dunk — and the legal status of "training on another model's outputs" is genuinely unsettled.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  "How long does it take / how much does it cost?"
&lt;/h2&gt;

&lt;p&gt;This is the question everyone asks, and the honest answer is: dramatically less than training from scratch — which is the entire economic motive — but &lt;strong&gt;precise figures for any specific alleged case are not public.&lt;/strong&gt; Anyone quoting you an exact "they did it in N days for $M" is guessing.&lt;/p&gt;

&lt;p&gt;What we can say structurally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pretraining a frontier model from scratch&lt;/strong&gt; means a massive run on tens of thousands of high-end accelerators, plus the data pipeline and research iteration behind it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distillation collapses that timeline.&lt;/strong&gt; The expensive part — discovering the capability — was already paid for by the teacher. The student's cost is roughly: generating a synthetic dataset (API calls + time) plus a comparatively cheap fine-tuning run. That's the asymmetry the U.S. labs are upset about: they spend billions to push the frontier, and a "free-rider" can chase it for a fraction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This is also why DeepSeek's headline numbers were so contested.&lt;/strong&gt; Its self-reported low training cost and modest hardware footprint were precisely what made rivals suspect a shortcut: it's much easier to hit those numbers if you bootstrapped from an already-trained Western teacher rather than doing all the discovery yourself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So: distillation makes a &lt;em&gt;strong-ish student&lt;/em&gt; fast and cheap. It does &lt;strong&gt;not&lt;/strong&gt; let you leapfrog &lt;em&gt;past&lt;/em&gt; the teacher — a student is generally capped by the teacher it learned from. You don't distill your way to the frontier; you distill your way to a cheap copy of someone else's.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Distillation is standard, published deep-learning practice. The labs complaining about it use it themselves.&lt;/li&gt;
&lt;li&gt;White-box distillation needs logits; closed APIs only expose text, so distilling from Claude/GPT means &lt;strong&gt;black-box&lt;/strong&gt; training on generated outputs.&lt;/li&gt;
&lt;li&gt;The OpenAI and Anthropic allegations against DeepSeek, Moonshot, and MiniMax are about &lt;strong&gt;unauthorized access and ToS violations&lt;/strong&gt;, not about distillation being inherently illegitimate — and they remain &lt;em&gt;allegations&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;The economic point is real: distillation is far cheaper than frontier pretraining, which is why it's a business and policy flashpoint. But a student is bounded by its teacher.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want the deep technical version of any of these — the math of temperature scaling, why mixing hard and soft labels beats either alone, or how behavioral fingerprinting tries to &lt;em&gt;detect&lt;/em&gt; distillation — let me know in the comments.&lt;/p&gt;




&lt;h3&gt;
  
  
  Sources &amp;amp; further reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI memo to the U.S. House Select Committee on China (Feb 2026) — reporting via Reuters and Rest of World.&lt;/li&gt;
&lt;li&gt;"Anthropic joins OpenAI in flagging distillation campaigns by Chinese AI firms," CNBC, Feb 24, 2026.&lt;/li&gt;
&lt;li&gt;Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network" (2015).&lt;/li&gt;
&lt;li&gt;"Understanding LLM Distillation Techniques," MarkTechPost, 2026.&lt;/li&gt;
&lt;li&gt;"The Bridge-Garden Dilemma in LLM Distillation," arXiv:2605.26246.&lt;/li&gt;
&lt;li&gt;Winston &amp;amp; Strawn, "Is AI Distillation by DeepSeek IP Theft?" (analysis of the legal gray zone).&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>llm</category>
      <category>deeplearning</category>
    </item>
  </channel>
</rss>
