<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arthur</title>
    <description>The latest articles on DEV Community by Arthur (@arthurpro).</description>
    <link>https://dev.to/arthurpro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906866%2Fd0e24b44-8169-4789-9e67-cc5b4e067b97.png</url>
      <title>DEV Community: Arthur</title>
      <link>https://dev.to/arthurpro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arthurpro"/>
    <language>en</language>
    <item>
      <title>A voice agent is not a chatbot with a phone number</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Thu, 18 Jun 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/a-voice-agent-is-not-a-chatbot-with-a-phone-number-hih</link>
      <guid>https://dev.to/arthurpro/a-voice-agent-is-not-a-chatbot-with-a-phone-number-hih</guid>
      <description>&lt;p&gt;The cleanest illustration of why this matters comes from a small, ordinary failure on a small, ordinary outbound campaign that I've been reading about: roughly one day, a few hundred cold-call attempts, and about $100 of telephony plus STT plus TTS plus model spend, evaporated by a voice agent that occasionally found itself dialing into someone else's voicemail or IVR or, the most expensive case, &lt;em&gt;another voice agent&lt;/em&gt;. The exchange the operator screenshotted is the kind of thing that reads as a comedy bit until you remember it's billing the whole time:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;— Hello.&lt;br&gt;
— Hello, how can I help you?&lt;br&gt;
— I'm calling because…&lt;br&gt;
— Hello, how can I help you?&lt;br&gt;
— Sure, could you tell me…&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In a chat window this would be a funny screenshot. On a phone it's billing the whole time — telephony plus STT plus TTS plus model tokens, on two endpoints, both confidently polite, neither programmed to recognise the other side as a peer and hang up. The lesson the operator pulled from this, and the one I want to walk through, is the larger one: a voice agent is not a chatbot with a phone number. It's a realtime system, and almost every "voice agent failure in production" I've now read about reduces to chat-architecture assumptions being applied to a medium that doesn't tolerate them.&lt;/p&gt;

&lt;p&gt;Let me unpack what specifically doesn't translate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency in chat and latency on a call are different objects
&lt;/h2&gt;

&lt;p&gt;In a chat the unit of cost is "time until the model starts streaming a reply." A two-second pause is fine. The user is reading the previous turn or sipping coffee or alt-tabbed away. In a phone call the unit of cost is &lt;em&gt;time of silence on an open audio channel&lt;/em&gt;, and that has a perceptual budget set by human conversational physiology, not by your latency dashboards.&lt;/p&gt;

&lt;p&gt;The specific budget is well-studied. Levinson and Torreira's 2015 paper &lt;a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4464110/" rel="noopener noreferrer"&gt;&lt;em&gt;Timing in turn-taking and its implications for processing models of language&lt;/em&gt;&lt;/a&gt;, drawing on a corpus across ten languages, reports that the typical gap between turns in natural conversation is around &lt;strong&gt;200 milliseconds&lt;/strong&gt;, with modal values clustering in the 100–300ms range — and overlap is more common than long pauses. The authors note the cognitive trick that makes this possible: speakers begin planning their response &lt;em&gt;before&lt;/em&gt; the previous turn ends. Two hundred milliseconds is an interaction signature, not a latency target you choose.&lt;/p&gt;

&lt;p&gt;Once you exceed that, perceptual breakdowns happen on a sliding scale. The voice-AI industry — see, e.g., &lt;a href="https://www.assemblyai.com/blog/low-latency-voice-ai" rel="noopener noreferrer"&gt;AssemblyAI's "300ms rule" writeup&lt;/a&gt; — converges on a perceptual gradient: by 300–400ms the listener is starting to notice the silence; by 500ms they're starting to assume something is wrong with their own line; sub-500ms is the working threshold below which an agent feels live. Retell AI, one of the larger commercial platforms, &lt;a href="https://www.retellai.com/" rel="noopener noreferrer"&gt;claims about 600ms&lt;/a&gt; end-to-end and frames that as competitive. It is competitive, and that also tells you the ceiling: even the leading systems are sitting just above the perceptual breakdown line, not below it.&lt;/p&gt;

&lt;p&gt;Now look at what the chat-style architecture has to fit inside that budget on every turn:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;streaming STT to recognise the user's speech;&lt;/li&gt;
&lt;li&gt;LLM call (with potentially several tool calls — CRM lookup, calendar check, database query);&lt;/li&gt;
&lt;li&gt;response generation;&lt;/li&gt;
&lt;li&gt;streaming TTS, with first-byte audio out the door before the rest is ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a chat you can spend several seconds on this and the user waits. On a call you cannot, because the other party is not waiting; they are filling the silence with "hello?" and starting to repeat themselves and asking if you're still there. The streaming transcript captures all of it. The model now has to respond to a turn that is partly the original question and partly the interruption-and-repeat, and the conversation begins to liquefy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The big-prompt problem doesn't translate
&lt;/h2&gt;

&lt;p&gt;The chat reflex when an agent isn't reasoning well is to make the prompt longer. Add more rules. Add more examples. Add more tools. A long context-rich system prompt is the standard chat-deployment pattern.&lt;/p&gt;

&lt;p&gt;In voice, this fails for a separate reason, distinct from the &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;context-rot problem&lt;/a&gt; Anthropic and the Lost-in-the-Middle line of research have written about elsewhere (although that problem is also present). The voice-specific failure is &lt;em&gt;goal drift mid-call&lt;/em&gt;. The operator I'm retelling here used a low-latency Gemini Flash–class model for one project (Google documents a separate &lt;a href="https://ai.google.dev/gemini-api/docs/models" rel="noopener noreferrer"&gt;Live Preview&lt;/a&gt; line for native realtime audio; it's not clear from the source whether the operator was on Live Preview or on the standard Flash variant adapted to a voice pipeline). What the operator observed was that the model could keep up with the latency budget but, given a long playbook stuffed into one prompt, would lose track within a few turns of &lt;em&gt;which&lt;/em&gt; stage of the call it was in: had it asked about budget yet, was it still confirming identity, was it allowed to close. The model wasn't slow; it was disoriented. A fast model with a long prompt is not the same as a fast, focused model.&lt;/p&gt;

&lt;p&gt;The substitution that works isn't a smarter model. It's an explicit graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Calls are graphs, not soup
&lt;/h2&gt;

&lt;p&gt;A voice agent that holds up in production does not look like a single "be helpful and talk to the customer" prompt. It looks like a set of named stages with explicit transitions, each stage carrying a short instruction, restricted tool access, and explicit fallbacks. The platforms that ship voice agents (Retell's flow editor, &lt;a href="https://elevenlabs.io/conversational-ai" rel="noopener noreferrer"&gt;ElevenLabs's Conversational AI workflow editor&lt;/a&gt;) make this graph structure visible, because that's what works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Greeting]
    │
    ▼
[Identity check] ── wrong person ──▶ [Apologise] ─▶ [End call]
    │
    ▼ identity confirmed
[Consent] ── not given ──▶ [Apologise] ─▶ [End call]
    │
    ▼ consent given
[Question 1] ─▶ [Question 2] ─▶ [Question 3]
    │
    ▼
[Closing]
    │
    ▼
[End call]

Fallbacks (any state):
  voicemail detected   ─▶ [Leave message] ─▶ [End call]
  human IVR             ─▶ [Press digit / wait for transfer]
  technical issue       ─▶ [Apologise + "we'll call back"] ─▶ [End call]
  another bot detected  ─▶ [End call]
  budget cap reached    ─▶ [End call]   (hard limit; not a prompt instruction)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is dull engineering. It is also the engineering that turns "the agent sometimes gets confused" into "the agent's behaviour is auditable and the failure modes are named." Each stage has a budget — both in tokens and in real seconds — and each transition is explicit. &lt;em&gt;No&lt;/em&gt; stage's instruction is "use your judgement"; if a stage needs judgement, that's a sign it should be split into two stages.&lt;/p&gt;

&lt;h2&gt;
  
  
  What works in voice (and what doesn't)
&lt;/h2&gt;

&lt;p&gt;The same operator's piece is candid about which categories of voice deployment they made work and which they couldn't. The patterns are clean enough to tabulate; what's interesting is &lt;em&gt;why&lt;/em&gt; the column splits look the way they do.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;What changes vs. chat&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Inbound lead qualification&lt;/strong&gt; (small fixed questionnaire)&lt;/td&gt;
&lt;td&gt;Closed-world flow; user has consented to the call by submitting the form; small graph with a clear success criterion&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Worked.&lt;/strong&gt; ~40 hours/week saved on a four-rep team.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Webinar attendance reminders&lt;/strong&gt; (call N minutes before start)&lt;/td&gt;
&lt;td&gt;Single objective, single FAQ branch ("who are you / what's the webinar about"), short call&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Worked.&lt;/strong&gt; Attendance lifted from ~10% to ~30%.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Cold outbound&lt;/strong&gt; (open-world dial)&lt;/td&gt;
&lt;td&gt;Voicemail, IVR, gatekeepers, other bots, "send us an email instead," "I don't make those decisions," "who gave you my number" — each needs explicit behaviour&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Did not work.&lt;/strong&gt; $100/day burning on indeterminate paths.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern is structural, not coincidental. Inbound and reminders have a &lt;em&gt;closed world&lt;/em&gt;: you control the flow because you also control the entry point. The user dialled or opted in; they're inside your graph from second one. Cold outbound has the opposite property: the world dials back, and the world contains things your graph does not. The right default for cold outbound is therefore not a smarter agent or a better prompt; it's a more aggressive &lt;em&gt;exit policy&lt;/em&gt; — every recognisable open-world input maps to a transition that ends the call without burning cycles.&lt;/p&gt;

&lt;p&gt;The hidden cost is that every one of those open-world inputs has to be &lt;em&gt;recognised&lt;/em&gt; before you can transition on it. Recognising "this is voicemail and not a person" is itself a hard signal-processing task, and getting it wrong on either side is expensive: false positives end calls with real prospects, false negatives leave the agent monologuing to a beep for the maximum call duration the platform allows. (And if the platform has no maximum, which is somebody's first oversight, the bill is the limit.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Why managed voice platforms are not just "Twilio with a wrapper"
&lt;/h2&gt;

&lt;p&gt;You can build all of this directly on &lt;a href="https://www.twilio.com/en-us/voice" rel="noopener noreferrer"&gt;Twilio's media streams&lt;/a&gt; and your own STT/TTS/LLM pipes. The case for using a managed voice-agent platform (Retell, ElevenLabs, or one of several others that have appeared in the last 18 months) isn't that they're hard to imitate. It's that the things they ship under the hood are exactly the things that make the difference between a demo and a production deployment, and you only realise this after you've discovered them yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interruption handling.&lt;/strong&gt; When the user talks over the agent, the TTS has to actually stop, the STT has to absorb the new turn, and the agent state has to update. "The TTS stops mid-syllable" is not a free behaviour; it's the result of a tightly integrated audio pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming STT/TTS coordination with first-byte targets.&lt;/strong&gt; Generating a full response and &lt;em&gt;then&lt;/em&gt; sending it to TTS is fatal for latency. Streaming the text as it's generated, and beginning TTS on the first sentence, is fatal *un*tested. There is no architecture-on-paper that gets this right; it has to be tuned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regression tests for prompts and tool calls.&lt;/strong&gt; When you change the wording in the consent stage, you want to know that the budget-question stage didn't silently start failing. The platforms ship saved-conversation regression tests precisely because hand-written voice tests are unreasonably hard to maintain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard limits on call duration and spend.&lt;/strong&gt; Not a prompt — &lt;em&gt;a limit&lt;/em&gt;. If the agent enters an infinite politeness loop with another bot, the call has to end because the limit said so, not because the agent reasoned its way out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-call extraction.&lt;/strong&gt; A consistent set of fields pulled from the transcript at end-of-call, rather than asked of the model live.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What the platforms are actually selling is &lt;em&gt;the boring stuff that turns out to be load-bearing&lt;/em&gt;. It is much cheaper to buy this than to discover what each piece is for and rebuild it badly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pre-launch checklist
&lt;/h2&gt;

&lt;p&gt;If I were starting on a voice agent today, this is the order I'd want answered before I picked a model:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are the named stages of a typical successful call?&lt;/li&gt;
&lt;li&gt;What's the &lt;em&gt;one&lt;/em&gt; objective of each stage?&lt;/li&gt;
&lt;li&gt;What inputs does each stage expect, and what data does it have access to?&lt;/li&gt;
&lt;li&gt;Which tools are valid in which stages? (Most stages should have &lt;em&gt;zero&lt;/em&gt;.)&lt;/li&gt;
&lt;li&gt;What are the legal transitions? Which transitions are explicitly forbidden?&lt;/li&gt;
&lt;li&gt;What counts as success? What counts as a dead end?&lt;/li&gt;
&lt;li&gt;When is the agent &lt;em&gt;required&lt;/em&gt; to end the call?&lt;/li&gt;
&lt;li&gt;How is voicemail recognised? IVR? Another bot?&lt;/li&gt;
&lt;li&gt;What's the latency budget for each stage, and how do we know we hit it?&lt;/li&gt;
&lt;li&gt;Which conversations do we save as regression tests?&lt;/li&gt;
&lt;li&gt;What's the per-call spend cap that auto-terminates? (This is not optional.)&lt;/li&gt;
&lt;li&gt;What's the per-day spend cap on the campaign?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The last two are not jokes. The classic voice-agent incident is the cost-disaster one — not because the agent did something dramatic, but because nobody set the limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm taking from this
&lt;/h2&gt;

&lt;p&gt;The framing I keep coming back to is that voice agents are not the natural successor to chatbots. They're a different class of system that happens to share an LLM. The chat lineage tells you to hand the model a long prompt, give it broad tool access, and let it figure out the conversation; the voice medium punishes every one of those choices. The systems that work in voice tend to be small, explicit graphs with named stages, narrow tool grants per stage, hard time-and-money limits, and an aggressive exit policy when the world doesn't behave like the graph expected.&lt;/p&gt;

&lt;p&gt;The one-line summary the operator I read closes on is the one I'd keep: &lt;em&gt;making the agent call is not the hard part; making it stop calling, in the right way, at the right time, when it's clearly off the rails, is the hard part.&lt;/em&gt; That's the engineering. Everything before it is plumbing.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>aiagents</category>
      <category>llm</category>
      <category>latency</category>
    </item>
    <item>
      <title>Ninety-one percent accurate is not what it sounds like</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/ninety-one-percent-accurate-is-not-what-it-sounds-like-3ji7</link>
      <guid>https://dev.to/arthurpro/ninety-one-percent-accurate-is-not-what-it-sounds-like-3ji7</guid>
      <description>&lt;p&gt;The April 2026 &lt;em&gt;New York Times&lt;/em&gt; commission of &lt;a href="https://openai.com/index/introducing-simpleqa/" rel="noopener noreferrer"&gt;Oumi to test Google's AI Overviews against the SimpleQA benchmark&lt;/a&gt; produced two numbers that were widely reported and one that mostly was not. The widely reported numbers: 85% accuracy on Gemini 2 in the AI Overview slot, 91% on Gemini 3. &lt;em&gt;Roughly one in ten answers wrong&lt;/em&gt;, in headlines from TechSpot, Futurism, Newsweek, BigGo, TechRepublic, Breitbart, Computing.co.uk, Newsbytes, Algorythmic, and DigitalToday. The number that mostly didn't make the headlines, but should have: among the answers the benchmark scored as &lt;em&gt;correct&lt;/em&gt;, Oumi tracked how often the AI Overview's stated claim was actually supported by the source it cited, and the un-supported rate &lt;em&gt;grew&lt;/em&gt; between the model upgrades — 37% of correct answers ungrounded on Gemini 2, &lt;strong&gt;56% on Gemini 3&lt;/strong&gt;. The model got more accurate; its summaries got less faithful to what their citations actually said.&lt;/p&gt;

&lt;p&gt;That is the part of the story that I want to spend most of this essay on, because once you sit with it for a moment it stops looking like a quirk of one analysis and starts looking like the shape of the entire AI-search class of product. The 9% error number is interesting; the source-claim divergence is structural; and the trust-budget the interface establishes against either of them is the thing that determines whether your week of casually reading AI-summarised search results was useful or actively misleading.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ninety-one percent comes from
&lt;/h2&gt;

&lt;p&gt;The arithmetic is unkind. SimpleQA is OpenAI's &lt;a href="https://openai.com/index/introducing-simpleqa/" rel="noopener noreferrer"&gt;4,326-question benchmark&lt;/a&gt; of short fact-seeking questions, each constructed to have a single time-stable answer that two independent annotators agreed on, and each filtered through a third annotator on a thousand-question subset for additional QA. It is a &lt;em&gt;clean&lt;/em&gt; benchmark — almost cruelly so. The questions are not the kind of thing your laptop's AI search receives in a normal day. SimpleQA asks "Who was the second-place finisher in the 1992 IOC presidential election?" and your laptop is asked to compare two pairs of trail-running shoes that were released last quarter. The benchmark is not load-bearing on the realism front. It is load-bearing on the &lt;em&gt;can the model retrieve a fact that it has the data for&lt;/em&gt; front.&lt;/p&gt;

&lt;p&gt;Google's response to the analysis was that real users don't ask SimpleQA-shaped questions; their internal benchmarking, on more representative queries, produces different (better, in their telling) numbers. That's a defensible point, and at the same time the standalone Gemini 3 hallucination rate Google itself disclosed in their pushback was around 28% — measured on Google's own internal benchmark, not SimpleQA, so the two numbers don't subtract cleanly. The directional point survives: grounding is doing real work, and the 9% on SimpleQA is the residual after RAG has already suppressed a substantial fraction of standalone failure. The 9% that remains is what's left after the work is done — the residual failures that grounding cannot fix because they don't live inside the model's pretraining; they live in the seam between the model and the index it's allowed to consult.&lt;/p&gt;

&lt;p&gt;There are four obvious places to look for the seam, and the &lt;a href="https://arstechnica.com/google/2026/04/analysis-finds-google-ai-overviews-is-wrong-10-percent-of-the-time/" rel="noopener noreferrer"&gt;Oumi analysis&lt;/a&gt; and the surrounding industry literature taken together implicate all of them.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure stage&lt;/th&gt;
&lt;th&gt;What goes wrong&lt;/th&gt;
&lt;th&gt;Concrete shape&lt;/th&gt;
&lt;th&gt;Caught by RAG?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Query interpretation / branching&lt;/td&gt;
&lt;td&gt;The natural-language question is parsed into the wrong sub-queries; query branching splits a unitary question into pieces that don't recombine&lt;/td&gt;
&lt;td&gt;"Did this drug interact with that one in the trial?" branches to "what did the drug do?" + "what did the other drug do?" — and never asks the interaction question&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source ranking&lt;/td&gt;
&lt;td&gt;The retriever returns ranked-relevant documents that are popular but not authoritative&lt;/td&gt;
&lt;td&gt;The Reddit comment thread outranks the manufacturer's spec sheet for a query about manufacturer specs&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fact compilation&lt;/td&gt;
&lt;td&gt;The model picks the &lt;em&gt;modal&lt;/em&gt; claim across retrieved sources rather than the &lt;em&gt;correct&lt;/em&gt; one&lt;/td&gt;
&lt;td&gt;Three-out-of-five blog posts say the protein is X; the protein is actually Y; the AI Overview answers X&lt;/td&gt;
&lt;td&gt;Partially — depends on retriever quality and reranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-processing / smoothing&lt;/td&gt;
&lt;td&gt;The fluent generator paraphrases a citation's claim into something the citation does not actually support&lt;/td&gt;
&lt;td&gt;Of the 91% of answers Gemini 3 got right on SimpleQA, 56% had a gap between the claim and the cited source — up from 37% of the 85% correct on Gemini 2&lt;/td&gt;
&lt;td&gt;No — this is the seam grounding cannot reach&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row is where the source-claim divergence number is coming from. The model is grounded on real documents, retrieves them in a sensible-looking order, and then &lt;em&gt;rewrites&lt;/em&gt; the answer in a way that sounds authoritative and confident and doesn't faithfully match the document it cites. The 56% rate is &lt;em&gt;of the correct answers&lt;/em&gt; — i.e., among the 91 in 100 that scored as right under SimpleQA, 56 had a gap between the headline claim and the citation chain. The headline claim was right enough; the citation underneath wasn't faithful to what the source actually said. This is the load-bearing failure of the AI-search class, and it does not improve with model size. It is a &lt;em&gt;language&lt;/em&gt; failure, not a &lt;em&gt;retrieval&lt;/em&gt; failure. The fluency that makes the answer feel like a written human summary is the same fluency that smooths the citation chain into something you can no longer audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ninety percent compares to
&lt;/h2&gt;

&lt;p&gt;It is worth running the comparison the source piece I'm reading suggested, because it is the most useful frame I've seen for thinking about the &lt;em&gt;trust&lt;/em&gt; part of this. Major diagnostic errors at a Swiss teaching hospital, &lt;a href="https://pubmed.ncbi.nlm.nih.gov/10885353/" rel="noopener noreferrer"&gt;comparing antemortem clinical diagnoses against autopsy findings&lt;/a&gt;, ran 30% in 1972, 18% in 1982, and 14% in 1992 — a substantial improvement, attributable in the authors' reading to the rise of ultrasonography and endoscopy. &lt;em&gt;Minor&lt;/em&gt; diagnostic errors, the same paper found, almost doubled over the same period: 23% in 1972 to 46% in 1992. More tools, more granular wrongness alongside fewer catastrophic wrongness. None of this is a crisis. It is the rate at which a sophisticated profession running a busy hospital, with consulting peers and second opinions and post-hoc verification, gets things wrong.&lt;/p&gt;

&lt;p&gt;The headline number for AI Overviews, 9% on grounded SimpleQA, sits in the same numerical neighbourhood as 1990s-Swiss-clinic &lt;em&gt;major&lt;/em&gt; error rates. The two numbers aren't strictly commensurate — clinical diagnosis is multi-step reasoning across an entire patient encounter, SimpleQA is single-fact retrieval, and the scoring rubrics are very different — but the comparison is useful as a calibration of where 9% sits in the universe of human-institution error rates we already accept. It is comparable to a profession with two thousand years of practice, decade-over-decade tooling improvements, and explicit error-catching protocols. The comparison is, with that caveat, uncomfortably honest about where the technology is.&lt;/p&gt;

&lt;p&gt;The trouble is that the question of accuracy is not the only one that matters. The Swiss clinicians had three things AI search does not: peer consultation, second-opinion protocols, and a post-hoc verification step (the autopsy itself) that turned every individual error into a feedback signal for the institution. AI Overviews has none of these by construction. The user reads the summary, treats it as the answer, and moves on. There is no autopsy. The 9% errors that get through are not errors that get &lt;em&gt;caught&lt;/em&gt;; they are errors that propagate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the trust budget is wrong
&lt;/h2&gt;

&lt;p&gt;Here is where the second number, the 56% source-claim discrepancy, becomes the part of the story that should have been the headline. When a piece of software hands you an answer accompanied by a footnote-style citation marker, the user-experience signal of that interface is &lt;em&gt;this claim is verified by this source.&lt;/em&gt; You can in principle click the link, but the affordance is calibrated for the case where you don't. The interface is selling you a model of the world in which the claim and the citation are coupled tightly enough that you don't need to do the coupling yourself.&lt;/p&gt;

&lt;p&gt;The Oumi finding says that for over half of Gemini 3's grounded answers, that coupling is loose. The footnote does not say what the answer says. &lt;em&gt;Most&lt;/em&gt; of the time, the looseness is the kind that doesn't change the answer's truth value. &lt;em&gt;Some&lt;/em&gt; of the time, it does, and the SimpleQA scoring has already absorbed that into the 9% figure. The remaining looseness — the gap between "the claim is right enough" and "the cited source supports the claim" — is invisible from the surface.&lt;/p&gt;

&lt;p&gt;The interface is not making you a worse reasoner. It is offering you a trust gradient that is steeper than the underlying trust the system has earned. The 91% number sounds like &lt;em&gt;you can trust nine answers in ten.&lt;/em&gt; The 56% number says &lt;em&gt;of those nine, at least half have a citation chain that wouldn't survive a careful read.&lt;/em&gt; These are not contradictory. They describe two different things. The 91% is about the answer; the 56% is about whether you could reconstruct the answer's lineage if you tried.&lt;/p&gt;

&lt;p&gt;For most casual queries this difference does not matter, because the consequences of being wrong are small. For knowledge work — and the user populations that AI search has expanded into are increasingly composed of people doing knowledge work — the difference is the difference between "this is a faster way to do the same thing" and "this is a faster way to lose track of where my facts came from." The second one is the failure mode the trust gradient hides.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Swiss-clinic protocol question
&lt;/h2&gt;

&lt;p&gt;The reason the Swiss-clinic comparison is useful is that it points at the part of the problem that is solvable, even if it isn't being solved. &lt;em&gt;14% major error in clinical diagnosis is a fine number because the institution that produces it has overlapping verification protocols.&lt;/em&gt; The institution is the load-bearing thing, not the individual clinician. AI search at 9% does not have the institution. The user is the institution, and the user's verification protocol is "did the answer feel right."&lt;/p&gt;

&lt;p&gt;The engineering target this implies is not "drive the 9% down to 5%." It is &lt;em&gt;give the user back the verification protocol the interface took from them.&lt;/em&gt; Make the citation-claim coupling visible and verifiable in the UI, the same way Wikipedia's footnotes are. Surface the source-claim divergence number per answer, not per fleet. When the model isn't sure which of two retrieved sources is authoritative, &lt;em&gt;show both and ask&lt;/em&gt;, the way a clinician orders a second test rather than picking the median answer. None of this requires a better model. All of it requires a different relationship between the interface and the user, one that admits the actual numbers rather than papering over them.&lt;/p&gt;

&lt;p&gt;This is the kind of design conversation that is genuinely hard because it cuts against the entire commercial premise of the AI-search class of product. The premise is that the user gets a single, fluent, answer-shaped object. Adding verification protocols turns the answer-shaped object back into the multi-source reading task that AI search was supposed to replace. The honest version of the product, the one that admits the 56% number, is by construction a less impressive demo and a less attractive ad.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ninety-one percent actually means downstream
&lt;/h2&gt;

&lt;p&gt;The reason this is worth sitting with rather than dismissing is that 9% propagates. A user who consults AI search for fifteen factual claims in a week — on the SimpleQA-shaped subset, anyway — has, on average, inserted more than one wrong claim into their thinking, distributed in a way that doesn't correlate with the user's confidence in any individual claim. The wrong ones feel the same as the right ones. The Swiss clinicians had peer review and the autopsy; the user has the next time someone reads their work and disagrees, which is to say no protocol at all.&lt;/p&gt;

&lt;p&gt;This is not an argument against using AI search. It is an argument for understanding what we have. The most useful response, for an engineer, is to remember that 91% is a &lt;em&gt;floor&lt;/em&gt; number for the 9%-of-answers-wrong story and a &lt;em&gt;ceiling&lt;/em&gt; number for the trust the interface should be selling. The two should not converge; right now they do, and that's the part that's actively misleading rather than just imperfect. Treating AI search as a tool that gets things mostly right, and verifying the citation chain when the cost of being wrong matters, is the calibration the math actually supports.&lt;/p&gt;

</description>
      <category>aisearch</category>
      <category>llm</category>
      <category>gemini</category>
      <category>googleaioverviews</category>
    </item>
    <item>
      <title>Audit Logs Caught 14 Police Officers Stalking. They Just Got Harder to Read.</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Wed, 17 Jun 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/audit-logs-caught-14-police-officers-stalking-they-just-got-harder-to-read-9ng</link>
      <guid>https://dev.to/arthurpro/audit-logs-caught-14-police-officers-stalking-they-just-got-harder-to-read-9ng</guid>
      <description>&lt;p&gt;The &lt;a href="https://ij.org/police-have-reportedly-used-license-plate-readers-to-stalk-romantic-interests-at-least-14-times-in-recent-years/" rel="noopener noreferrer"&gt;Institute for Justice's analysis&lt;/a&gt;, published in late April and the subject of a 263-point Hacker News thread on May 1, identifies fourteen documented cases of US police officers using automated license-plate-reader networks to track romantic interests, ex-partners, or strangers they had personally fixed on. The bulk of the cases occurred since 2024. Most of the officers named in the analysis were criminally charged. Most lost their jobs by either resigning or being fired.&lt;/p&gt;

&lt;p&gt;The IJ analysis is careful about its own scope. &lt;em&gt;"The 14 cases listed below are almost certainly an undercount,"&lt;/em&gt; the report notes, listing reasons: not all police misconduct gets detected; some cases get resolved quietly; &lt;em&gt;"Officers frequently cite vague or inaccurate reasons for their searches in ALPR systems, sometimes to evade detection of misconduct."&lt;/em&gt; Most of the cases that did surface, the IJ analysis observes, surfaced &lt;em&gt;"only after victims reported the officers' behavior to the police, typically in the context of a broader stalking allegation."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What follows is what is publicly known. The list is short; the structural facts behind it are what the rest of this article is about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four names
&lt;/h2&gt;

&lt;p&gt;The IJ analysis names individual officers, departments, dates, and outcomes. The four cases below are representative of the fourteen, and each has been independently corroborated through the cited reporting.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Officer / Department&lt;/th&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Charges&lt;/th&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;th&gt;What was used&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Officer Michael McSherry, Westmoreland County PD (Pennsylvania)&lt;/td&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;Stalking&lt;/td&gt;
&lt;td&gt;Pleaded guilty&lt;/td&gt;
&lt;td&gt;License-plate-reader queries against estranged wife and other family members&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lieutenant Victor Heiar, Kechi PD (Kansas)&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Computer crime + stalking&lt;/td&gt;
&lt;td&gt;Pleaded guilty&lt;/td&gt;
&lt;td&gt;Flock cameras to track estranged wife&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Officer Robert Josett, Costa Mesa PD (California)&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Multiple criminal charges (filed April 2026)&lt;/td&gt;
&lt;td&gt;Pleaded guilty&lt;/td&gt;
&lt;td&gt;Flock camera system to track mistress and her other romantic interests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deputy Lamar Eliseo Roman, Monroe County Sheriff's Office (Florida)&lt;/td&gt;
&lt;td&gt;February 2026&lt;/td&gt;
&lt;td&gt;Alleged stalking; charges pending&lt;/td&gt;
&lt;td&gt;Under investigation&lt;/td&gt;
&lt;td&gt;ALPR hotlist after meeting target on a TV-set security detail&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The other ten cases follow similar shapes. Most involve a current or former intimate partner of the officer. A smaller number, of which the Roman case is the most recent, involve a stranger the officer had become fixated on and acquired access to track. Outcomes range across criminal conviction, administrative discipline, and, on the IJ analysis's own framing, officer resignation as the resolution mechanism — with the well-documented broader pattern that resignation in policing does not always end a law-enforcement career.&lt;/p&gt;

&lt;p&gt;The four cases above are the ones whose paperwork has reached its final disposition. The most useful detail in the table is the column the structural argument hinges on. In every case the offending officer used a system that recorded the searches at the moment they happened. In every case the records existed for years before the case opened. The audit logs were written when the queries ran. They were read when, and only when, an investigation arrived to read them.&lt;/p&gt;

&lt;h2&gt;
  
  
  How these fourteen surfaced
&lt;/h2&gt;

&lt;p&gt;The cases the IJ analysis was able to document mostly came to light through one channel: the victim filed a complaint, an investigation opened a stalking allegation, and the LPR queries the officer had run against the victim's plate became evidence in that investigation. &lt;em&gt;"Only a few of the 14 analyzed cases were initially discovered through internal investigations,"&lt;/em&gt; the report observes. The audit logs that recorded the suspicious queries existed in every case. They were generally not what triggered the investigation. They were what the investigation later relied on.&lt;/p&gt;

&lt;p&gt;This gap between &lt;em&gt;audit-log existence&lt;/em&gt; and &lt;em&gt;audit-log review&lt;/em&gt; has a structural explanation that surfaced in the HN thread on the IJ piece. Several commenters with relevant procurement experience pointed out that Flock-style ALPR systems are typically not licensed by seat, do not require single sign-on, and are routinely accessed via shared departmental accounts. Per-officer query patterns are technically reconstructable from the underlying logs, but operationally they are not aggregated against patterns of misuse. One court-watcher in the same thread, who has volunteered for years observing domestic-violence court proceedings, reported that &lt;em&gt;"Cases where a state surveillance tool or database was used to stalk or harass the victim are completely routine."&lt;/em&gt; The IJ list is what shows up after a victim, an investigation, and a court proceeding have all happened in sequence. The rate at which any prior step fails is not a number the public sees.&lt;/p&gt;

&lt;h2&gt;
  
  
  The auditing surface is closing
&lt;/h2&gt;

&lt;p&gt;In the same six-month window the IJ list was being compiled, two simultaneous structural moves narrowed the public-disclosure surface that produced it.&lt;/p&gt;

&lt;p&gt;The first was Flock's own December 2025 audit-log change, reported on the HN thread by a public-records requester who had been routinely filing for the audit logs in their town. Until December 2025, the audit logs were listed &lt;em&gt;"by USERID"&lt;/em&gt;, allowing an outside reviewer to correlate query volume against individual officers and identify outlier behaviour. As the requester observed, &lt;em&gt;"This same methodology has been used to catch police stalking in at least one other city."&lt;/em&gt; After the December 2025 update, the same logs were &lt;em&gt;"completely serialized, anonymized"&lt;/em&gt;, removing the per-userid correlation entirely. The change came after 2025 had surfaced several cases of police stalking using Flock data. The reporting cause and the system change occurred in the same calendar year.&lt;/p&gt;

&lt;p&gt;The second was Washington State's &lt;a href="https://app.leg.wa.gov/billsummary?BillNumber=6002&amp;amp;Year=2025" rel="noopener noreferrer"&gt;SB 6002&lt;/a&gt;, the &lt;em&gt;Driver Privacy Act&lt;/em&gt;, &lt;a href="https://stateofsurveillance.org/news/washington-sb6002-driver-privacy-act-alpr-flock-2026/" rel="noopener noreferrer"&gt;signed by Governor Bob Ferguson on March 30, 2026&lt;/a&gt; and effective immediately. The bill's substantive privacy provisions are real: it imposes a 21-day data-retention limit (the original draft had set 72 hours, amended in committee), bans federal immigration access, and prohibits ALPR cameras near food banks, schools, courts, or places of worship. It also, in &lt;a href="https://mrsc.org/stay-informed/mrsc-insight/april-2026/license-plate-reader-data" rel="noopener noreferrer"&gt;Section 5&lt;/a&gt;, exempts ALPR data from disclosure under the state public-records act:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Automated license plate reader data is not subject to disclosure under the public records act, chapter 42.56 RCW, except such data may be used for bona fide research as defined in RCW 42.48.010 and does not include individually identifiable information.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The exemption is narrow on its face. The category of &lt;em&gt;"bona fide research"&lt;/em&gt; the carve-out preserves is the category formal academic and policy researchers operate under. The category it excludes is the one the IJ analysis itself depended on: working journalists, civil-liberties organisations, and individual public-records requesters who file because they noticed an audit-log irregularity in their town. The cases that surfaced this list surfaced on the back of exactly this kind of grass-roots filing. The mechanism is now shut, in Washington at least, except for the research carve-out — which does not produce the kind of identification the IJ list contains.&lt;/p&gt;

&lt;h2&gt;
  
  
  A second failure mode, same architecture
&lt;/h2&gt;

&lt;p&gt;The fourteen-officer list is not the only public surface where the Flock query log has produced public-record findings this year. The previously-published &lt;a href="//../2026-05-04-flock-demo-partner-program/article.md"&gt;&lt;em&gt;Demo Partner Program&lt;/em&gt;&lt;/a&gt; covered the parallel disclosure that Flock's own employees had been seen accessing private-business camera feeds — a children's gymnastics studio, a community pool — in audit-log entries that named accounts including the company's Director of Growth and VP of Strategic Relations. Two failure modes, same architecture: a system whose query surface is wider than its accountability surface, and which produces evidence of misuse only when an outside party files for the logs and reads them. In both cases, the disclosure was traceable to per-account identification in the audit data. In both cases, the structural answer being deployed is to remove the per-account identification from the audit data.&lt;/p&gt;

&lt;p&gt;The answer is consistent across vendor and state. The visible category of failures has not been the official trigger for either response. The official triggers, where they have been articulated, are &lt;em&gt;privacy&lt;/em&gt; (in the case of the WA law) and &lt;em&gt;security&lt;/em&gt; (in the case of the Flock log update). The public-records-requester category — the only category that produced any of these lists — is not one of the protected categories on either side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coda
&lt;/h2&gt;

&lt;p&gt;Fourteen names is a list, not a thesis. It is also not, by anyone's reckoning, the count. It is the count of cases in which a victim filed a complaint, an investigation opened, the LPR query trail entered evidence, the case became reportable, and a public-interest organisation found and aggregated the reporting. The category of cases that fail any one of those steps is not on the list, by design.&lt;/p&gt;

&lt;p&gt;The next round of names will be harder to find. The audit-log format that surfaced the present round was changed in December 2025; the public-records mechanism that produced the IJ aggregation is, in Washington, closed as of March 2026. The closing is not a coincidence; it is the predictable response of an operating system to the disclosure of its failure modes. &lt;em&gt;Fourteen&lt;/em&gt; will not be the figure when this is next reported. &lt;em&gt;Fewer than fourteen&lt;/em&gt; will be the figure, because the visibility surface is narrower. The next round of officers will not be a smaller cohort. They will only be a less-counted one.&lt;/p&gt;

</description>
      <category>surveillance</category>
      <category>alpr</category>
      <category>flock</category>
      <category>civilliberties</category>
    </item>
    <item>
      <title>The Slot-Machine Was the Point</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Wed, 17 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/the-slot-machine-was-the-point-4fm1</link>
      <guid>https://dev.to/arthurpro/the-slot-machine-was-the-point-4fm1</guid>
      <description>&lt;p&gt;&lt;a href="https://larsfaye.com/articles/agentic-coding-is-a-trap" rel="noopener noreferrer"&gt;Lars Faye's &lt;em&gt;Agentic Coding Is a Trap&lt;/em&gt;&lt;/a&gt; — published Sunday, May 3, picked up on Hacker News at 398 points and 316 comments — is the best single compendium of the cognitive-debt evidence base anyone has put together in 2026. It catalogues the studies. It names the trade-offs. It lands on a personal-discipline conclusion. The receipts are now collected; the careful reader will have spent the weekend nodding through them.&lt;/p&gt;

&lt;p&gt;Buried in Faye's second paragraph, almost in passing, is the line that does the actual analytical work. Faye describes the agentic workflow as a process in which &lt;em&gt;"someone defines the project's requirements ... generates a plan, and then &lt;a href="https://blog.quent.in/blog/2026/03/09/one-more-prompt-the-dopamine-trap-of-agentic-coding/" rel="noopener noreferrer"&gt;pulls the slot machine lever&lt;/a&gt; over and over, iterating and reiterating with often multiple agent instances until it's done."&lt;/em&gt; The link goes to a March post by Quentin Rousseau, CTO and co-founder of Rootly, titled &lt;em&gt;One More Prompt: The Dopamine Trap of Agentic Coding.&lt;/em&gt; The metaphor isn't Faye's. Rousseau got there first, in clinical language: the workflow runs on &lt;em&gt;"variable ratio reinforcement — the same psychological mechanism that makes slot machines the most addictive form of gambling"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That is the framing the rest of Faye's piece is downstream of, and it is the framing this article is about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the receipts add up to
&lt;/h2&gt;

&lt;p&gt;Faye's catalogue, briefly. Anthropic's own &lt;a href="https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic" rel="noopener noreferrer"&gt;research note on internal use&lt;/a&gt; names what it calls the &lt;em&gt;"paradox of supervision"&lt;/em&gt;: effective use of Claude requires the very skills that sustained Claude use atrophies. &lt;a href="https://www.media.mit.edu/publications/your-brain-on-chatgpt/" rel="noopener noreferrer"&gt;MIT Media Lab's &lt;em&gt;Your Brain on ChatGPT&lt;/em&gt;&lt;/a&gt; measured the cognitive impact and labelled it &lt;em&gt;cognitive debt&lt;/em&gt;. &lt;a href="https://www.404media.co/microsoft-study-finds-ai-makes-human-cognition-atrophied-and-unprepared-3/" rel="noopener noreferrer"&gt;A Microsoft study covered by 404 Media&lt;/a&gt; reached parallel findings for knowledge workers more broadly. &lt;a href="https://www.anthropic.com/research/AI-assistance-coding-skills" rel="noopener noreferrer"&gt;A separate Anthropic study on coding skills&lt;/a&gt; reported a 47% drop-off in debugging skills among engineers leaning heavily on AI-assisted workflows. Sandor Nyako, the LinkedIn engineering director who oversees fifty engineers, has &lt;a href="https://www.businessinsider.com/leaders-worry-about-skill-atrophy-due-to-ai-adoption-2025-10" rel="noopener noreferrer"&gt;reportedly asked his team&lt;/a&gt; not to use these tools for &lt;em&gt;"tasks that require critical thinking or problem-solving."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These are well-credentialed studies, performed mostly by parties with no incentive to overstate the effect. Each one names some symptom: cognitive debt, debugging atrophy, skill-formation interruption, supervisory paradox. The piece this article is responding to has done the hard work of collecting them.&lt;/p&gt;

&lt;p&gt;What the catalogue underspecifies is the &lt;em&gt;upstream&lt;/em&gt; question. Why does this particular workflow produce these particular symptoms? The answer is in the link Faye's second paragraph throws away.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Rousseau actually said
&lt;/h2&gt;

&lt;p&gt;Rousseau's March post is unusually direct. The author, writing as a working CTO of an early-stage company, names the workflow's reward schedule and its physiological consequences in the same paragraph. The agentic-coding loop, in Rousseau's account, is structured around &lt;em&gt;intermittent reinforcement&lt;/em&gt;. Sometimes the diff is what you wanted, sometimes not, sometimes spectacularly close, sometimes laughably wrong. The &lt;em&gt;"intermittent reinforcement of those dopamine and adrenaline hits creates the core addictive pull,"&lt;/em&gt; in Rousseau's phrasing. The behaviour the schedule produces, in Rousseau's reporting from the Y Combinator founder community he is part of: developers &lt;em&gt;"routinely coding until 2-4 AM despite no deadline pressure"&lt;/em&gt;, the author himself reaching for orexin-receptor-blocker prescriptions to push back against the wakefulness effect, and a public comparison from Garry Tan describing the dopamine return as comparable to manually finding the answer. Rousseau also reports that approximately 25% of the most recent Y Combinator batch has codebases described as &lt;em&gt;"almost entirely AI-generated"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is the framing Faye is referring to, and it is not metaphorical decoration. The engineering-cohort observation is that a particular workflow produces a particular reward schedule, and that reward schedule produces a particular pattern of behaviour, including pharmaceutical countermeasures. The behaviour pattern is not coincidence. It is the engineered output of the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the workflow is shaped for
&lt;/h2&gt;

&lt;p&gt;If the workflow's reward schedule is variable-ratio reinforcement, the question is whose problem that solves. The engineer's problem is that the work needs to get done. The vendor's problem is that the engineer needs to keep paying for tokens. The two problems do not point in the same direction; one of them gets solved more thoroughly than the other.&lt;/p&gt;

&lt;p&gt;Faye's piece links to &lt;a href="https://www.pymnts.com/artificial-intelligence-2/2026/ai-adoption-is-being-measured-in-tokens-but-the-metric-falls-short-experts-say/" rel="noopener noreferrer"&gt;reporting&lt;/a&gt; on a related dynamic: AI adoption inside organisations is being measured in tokens spent, and that measurement is being used as a proxy for productivity. Token count is the easiest number for an engineering-management dashboard to render; it is also the revenue line item for the vendor. The metric and the price of revenue are the same number, which is unusual, and worth thinking about. The &lt;a href="//../2026-05-04-token-pricing-adoption-curves/article.md"&gt;Uber data published earlier this month&lt;/a&gt;, with per-engineer monthly token bills running to $500–$2,000, the engineering organisation ramping from 32% to 84% adoption in four months, and the entire 2026 AI budget consumed in the first quarter, is the corporate-finance-line-item version of the YC founders Rousseau describes coding to 2 AM. The lever is the same lever; only the cadence and the venue differ. Each engineer pulling it at industrial frequency is one row in a budget the CFO did not anticipate.&lt;/p&gt;

&lt;p&gt;The alignment is not pedagogical. It is industrial. It is the same alignment that produced the previous decade's &lt;em&gt;attention economy&lt;/em&gt;, with the engineer in the seat the social-media user used to occupy.&lt;/p&gt;

&lt;h2&gt;
  
  
  We have done this before
&lt;/h2&gt;

&lt;p&gt;The historical analog is not assembly-to-FORTRAN, the comparison Faye explicitly rejects in his piece, and rejects correctly. &lt;em&gt;"a higher level of ambiguity is not a higher level of abstraction,"&lt;/em&gt; in Faye's phrasing, and the FORTRAN frame flatters the new tools by aligning them with a pedigree of advances they do not earn. The honest analog is closer to home, in the same fifteen-year window many readers of this piece have lived through.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Social-media attention economy (2010s)&lt;/th&gt;
&lt;th&gt;Agentic-coding token economy (2026)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reward shape&lt;/td&gt;
&lt;td&gt;Variable-ratio reinforcement (next post, next like)&lt;/td&gt;
&lt;td&gt;Variable-ratio reinforcement (next prompt, next diff)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Captive population&lt;/td&gt;
&lt;td&gt;Users who didn't realise they had opted in&lt;/td&gt;
&lt;td&gt;Engineers under top-down workflow mandates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Revenue mechanism&lt;/td&gt;
&lt;td&gt;Attention → ad inventory&lt;/td&gt;
&lt;td&gt;Tokens → metered consumption&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Externalised cost&lt;/td&gt;
&lt;td&gt;Mental health, polarisation, attention-deficit&lt;/td&gt;
&lt;td&gt;Cognitive debt, skill atrophy, vendor lock-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Industry rebuttal at scale&lt;/td&gt;
&lt;td&gt;"It's just a phone, put it down" (representative)&lt;/td&gt;
&lt;td&gt;"Demote AI's role" (Faye's prescription)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time from product launch to documented harm&lt;/td&gt;
&lt;td&gt;Roughly a decade (2010 → 2020)&lt;/td&gt;
&lt;td&gt;Roughly three years (2023 → 2026)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The compression of the recognition window is the part most worth noticing. The attention-economy harms took a decade to accumulate enough peer-reviewed evidence to argue about; the token-economy harms have a &lt;em&gt;paradox-of-supervision&lt;/em&gt; admission from the largest vendor inside three years. The cohort doing the measurement also happens to be the cohort being measured, which speeds the reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the lever pulls cost, one engineer at a time
&lt;/h2&gt;

&lt;p&gt;The HN thread on Faye's piece is unusually heavy on testimony from inside the senior bracket. The senior-engineer-cannot-answer-questions scene that the previously-published companion piece &lt;a href="//../2026-05-04-what-we-lose-coding-becomes-reviewing/article.md"&gt;&lt;em&gt;What We Lose When Coding Becomes Reviewing&lt;/em&gt;&lt;/a&gt; centred is one such datapoint; what concerns this piece is the moment immediately downstream, when the same engineer reaches for the same workflow again the next morning. One commenter with thirty-five years of experience offered a more cheerful counter, that agentic tools had let them learn more in the last few years than in the prior thirty-five, only to draw an immediate reply that this is a curve available only to engineers who already had thirty-five years of friction in the bank to draw on. Both readings can be right. The point one of them was making, deeper in the same comment thread, is the one that keeps catching: &lt;em&gt;"I think a great deal of what made computing an amazing industry to work in is going to or has already died."&lt;/em&gt; Whether the speaker is right depends on how the next five years go. The reading is not a complaint; it is a description offered without satisfaction by someone who watched the previous version.&lt;/p&gt;

&lt;p&gt;What the lever pulls cost the individual engineer, in the cases the studies are now measuring, is the cognitive practice that produced the engineer in the first place. The slot-machine analogy is exact in the wrong way: a casino visitor leaves with thinner pockets and the same brain. The agentic-coding loop costs the brain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coda
&lt;/h2&gt;

&lt;p&gt;The slot-machine framing is not a complaint. It is a description offered, not for the first time, by people who have noticed that the workflow's reward shape and the vendor's revenue shape are the same shape, and that the alignment has consequences. We have done this once before, with a different captive population and a different metering surface, and the consequences took a decade to be argued about with a straight face. The compressed timeline this time is a small mercy. The receipts arrived faster. The remaining question is whether the recognition is going to do any structural work, or whether the field, having decided that &lt;em&gt;demote AI's role&lt;/em&gt; is a sufficient answer at the individual level, will accept that as the answer at the institutional level too. The cost was not a bug. The cost was the design. Every previous case of this pattern was eventually answered by someone with the standing to write a rule about it. The slot-machine industry, eventually, accepted some.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agenticcoding</category>
      <category>devculture</category>
    </item>
    <item>
      <title>The junior-developer pipeline is a slow-motion arithmetic problem</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Tue, 16 Jun 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/the-junior-developer-pipeline-is-a-slow-motion-arithmetic-problem-bnh</link>
      <guid>https://dev.to/arthurpro/the-junior-developer-pipeline-is-a-slow-motion-arithmetic-problem-bnh</guid>
      <description>&lt;p&gt;The two numbers I want to start with are these. &lt;strong&gt;Stack Overflow's monthly question volume fell from 108,563 in November 2022 (the month ChatGPT launched) to 25,566 by December 2024&lt;/strong&gt;, a 76.5% drop, and by May 2025 monthly question volume had reverted to the level of Stack Overflow's first month in 2009. &lt;strong&gt;Brynjolfsson, Chandar and Chen's August 2025 Stanford Digital Economy paper, &lt;em&gt;Canaries in the Coal Mine?&lt;/em&gt;, with data through July 2025, found that software-developer employment for the 22-to-25-year-old cohort had declined nearly 20% from its late-2022 peak.&lt;/strong&gt; The first number describes a knowledge surface used by junior developers as a substitute for the mentors they didn't have. The second number describes the junior developers themselves.&lt;/p&gt;

&lt;p&gt;Each number, in isolation, has a defensible reading that doesn't rise to &lt;em&gt;crisis&lt;/em&gt;. Stack Overflow's traffic decline was already underway from mid-2021; ChatGPT accelerated rather than caused it. The 22-to-25 employment decline is real but is also entangled with a broader entry-level slowdown across the whole tech sector that has multiple causes. I want to take both numbers seriously without slipping into the apocalyptic register the topic invites, because the pipeline math underneath them is interesting on its own terms and it is the part the apocalyptic framings tend to skip past.&lt;/p&gt;

&lt;h2&gt;
  
  
  The arithmetic that doesn't move
&lt;/h2&gt;

&lt;p&gt;Software engineers age in a predictable curve. The career stages are well-documented, and they are stages, not a continuum: zero-to-two years is &lt;em&gt;junior&lt;/em&gt;; three-to-five is &lt;em&gt;mid-level&lt;/em&gt;; six-to-ten is &lt;em&gt;senior&lt;/em&gt;; ten-plus is &lt;em&gt;principal&lt;/em&gt; or &lt;em&gt;architect&lt;/em&gt;. The gates between stages are negotiated, not literal — companies use different titles, the boundaries blur — but the broad shape of the progression is stable across the industry and has been for two decades. The 2030 senior cohort is built primarily from the 2025 junior cohort, with a long tail of bootcamp graduates, lateral hires from adjacent fields, and returners that does not change the overall arithmetic. The 2035 principal cohort is built primarily from the 2025 mid-level cohort by the same mechanism.&lt;/p&gt;

&lt;p&gt;That sentence is the load-bearing thing. If the population of juniors hired in any given year shrinks materially, the population of seniors available five-to-ten years later shrinks proportionally. The shortcuts that exist — bootcamp accelerators, intensive apprenticeships, rapid promotions — produce structurally different judgment, and at population scale the substitution capacity is small relative to the cohort gap. The senior is somebody who has spent five-to-ten years making mistakes, getting reviewed, fixing things, debugging at 2am, and gradually accumulating the judgment that distinguishes them from a junior. The &lt;em&gt;time&lt;/em&gt; component is largely non-substitutable. There is a body of cognitive-science literature on expertise development — Anders Ericsson's deliberate-practice work is the canonical reference, with subsequent work qualifying the strength of the effect but not the underlying mechanism — that puts numbers on this, but you don't need the literature to recognise the pattern. You just need to look at the org chart of any company that has been operating for thirty years and see who got hired when.&lt;/p&gt;

&lt;p&gt;The arithmetic, then, is the arithmetic. SignalFire's &lt;a href="https://www.signalfire.com/blog/signalfire-state-of-talent-report-2025" rel="noopener noreferrer"&gt;&lt;em&gt;State of Tech Talent Report 2025&lt;/em&gt;&lt;/a&gt;, drawn from LinkedIn data on 600M+ professionals, reports that entry-level hiring at the 15 largest US tech firms fell 25% from 2023 to 2024 and that the share of new graduates in Big Tech &lt;em&gt;hires&lt;/em&gt; dropped from 32% in 2019 to 7% in 2024. Entry-level tech postings dropped 60% from 2022 to 2024, by &lt;a href="https://restofworld.org/2025/engineering-graduates-ai-job-losses/" rel="noopener noreferrer"&gt;other widely-cited tracking&lt;/a&gt;. Google and Meta have been hiring approximately half as many new graduates as they were in 2021. A &lt;a href="https://leaddev.com/the-engineering-leadership-report-2025" rel="noopener noreferrer"&gt;LeadDev 2025 &lt;em&gt;Engineering Leadership Report&lt;/em&gt;&lt;/a&gt; found that 54% of respondents expect long-term junior hiring to drop, and 38% agreed that AI tools have already reduced the direct mentoring junior engineers receive from seniors. None of these numbers are adjustable. They are the inputs to the senior-engineer-population calculation for the second half of the 2020s.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI is actually doing to junior work
&lt;/h2&gt;

&lt;p&gt;The piece of the story that's specific to AI rather than to the broader entry-level slowdown is what's happening to the &lt;em&gt;kind of work&lt;/em&gt; a junior would have done. The 2022-junior's first-year output — boilerplate, unit tests, small features, refactoring of clearly-bounded modules, writing the mid-level-engineer-or-better's draft of a function that the mid-level engineer would then revise — is what a senior engineer with Claude Code or Cursor or Copilot now produces in minutes. The &lt;em&gt;output&lt;/em&gt; is closer to what the senior would have asked for. The &lt;em&gt;cost&lt;/em&gt; of producing it is a small fraction of what the junior's salary represented. The fundamental engineering economics of training a junior have shifted because the training tasks themselves are no longer differentially profitable to give to a junior.&lt;/p&gt;

&lt;p&gt;The second-order finding from a year of operating data is that junior engineers &lt;em&gt;with&lt;/em&gt; AI tools are not in fact a competitive substitute for the seniors-with-AI workflow. Mark Russinovich and Scott Hanselman's February 2026 &lt;em&gt;Communications of the ACM&lt;/em&gt; piece coined the term &lt;em&gt;AI drag&lt;/em&gt; for the phenomenon: early-in-career developers using AI tools have a productivity disadvantage that mid-career developers don't have, because they lack the judgment to steer, verify, and integrate AI output. The 2025 LeadDev survey describes the same mechanism in different words — the 38% of leaders who say AI has reduced the mentoring juniors receive are observing the consequence of the same underlying gap. A junior with Claude Code produces output as fast as a senior with Claude Code, on the surface, but the output requires more rework downstream. The senior's marginal hour with AI is amplified. The junior's marginal hour with AI is the same hour with a lower bug-detection rate and a higher cleanup cost.&lt;/p&gt;

&lt;p&gt;This finding is the part that closes the trap. &lt;em&gt;If juniors with AI weren't differentially less effective than seniors with AI&lt;/em&gt;, the hiring decision would be a straightforward training-investment question — companies would still hire juniors because they're cheaper to train into seniors than seniors are to recruit laterally. Because juniors with AI are differentially less effective in the short term, the immediate-quarter math favours the senior-only team, and the immediate-quarter math is what budget cycles run on. The pipeline-math half of the equation operates on five-to-ten-year horizons that no quarterly review surfaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where seniors come from when juniors aren't being trained
&lt;/h2&gt;

&lt;p&gt;If you walk forward the arithmetic by half a decade, you reach a position the industry has not yet articulated honestly. &lt;em&gt;2030 senior-engineer demand&lt;/em&gt; is bounded above by &lt;em&gt;2025 junior-engineer hiring&lt;/em&gt;. The companies whose 2025 hiring decisions trimmed juniors will have proportionally fewer mid-level engineers in 2028 and proportionally fewer senior engineers in 2031. The &lt;em&gt;natural&lt;/em&gt; response of a company finding itself short of senior engineers in 2031 is to recruit them laterally — but the population available to recruit from is the same population every other company in the same situation will be trying to recruit from, and the labour-market clearing price for a 2031 senior engineer will reflect that scarcity.&lt;/p&gt;

&lt;p&gt;Companies that have hired juniors continuously through 2024–2026 will, in 2031, find themselves with a senior-engineer cohort their competitors cannot easily match. Companies that paused junior hiring during this window will face one of three options in 2031: pay the elevated lateral-hire premium for senior engineers; train mid-career hires (developers from outside the field, bootcamp graduates, returners) into seniors on a compressed timeline that almost certainly produces lower-quality outcomes; or scope down the work to what their existing senior population can do. None of these options is bad on its own. The &lt;em&gt;combination&lt;/em&gt; of all three across an industry that historically grew its senior pool through internal training has the predictable shape that economics has documented in any market where a single producer cohort tries to skip a generation of replacement workers — the producer-cohort price rises, the supply tightens, and the most-dependent buyers find themselves under contractual leverage they have not previously had to negotiate.&lt;/p&gt;

&lt;p&gt;The most uncomfortable version of this is the &lt;a href="https://evalcode.com/posts/if-you-stop-hiring-juniors-your-seniors-own-you/" rel="noopener noreferrer"&gt;eval(code)&lt;/a&gt; framing: &lt;em&gt;if you stop hiring juniors, your seniors own you.&lt;/em&gt; The framing is more glib than the underlying point deserves, but the underlying point survives. A senior-only engineering organisation is a labour-market position with a known direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kinds of work still build judgment
&lt;/h2&gt;

&lt;p&gt;The technical question that follows is what an &lt;em&gt;AI-era&lt;/em&gt; training path for junior engineers actually looks like. The conventional path — boilerplate-and-bugs producing pattern-recognition and judgment over five years — is the path AI tools have most thoroughly automated. The proposals on the table for what replaces it cluster around three patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read-and-explain work.&lt;/strong&gt; A junior who can take an unfamiliar codebase and produce a coherent explanation of what it does, where the failure modes are, and where the architectural decisions don't fit the current requirements, is doing the kind of work that builds the judgment a senior engineer needs. AI tools can produce a first-pass explanation faster than the junior can; they cannot produce the &lt;em&gt;judgment about which parts of the explanation to trust&lt;/em&gt; that the junior is being trained to develop. The exercise of producing the explanation, comparing it to AI-generated explanations, finding the discrepancies, and explaining the discrepancies is one shape of training that survives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verification-and-audit work.&lt;/strong&gt; The output of an LLM is most usefully treated as a draft that requires verification. Juniors who specialise in &lt;em&gt;verifying&lt;/em&gt; AI output — running the test cases, checking the citations, finding the cases the LLM didn't cover — are doing work that is structurally similar to code review and produces similar judgment. The &lt;em&gt;preceptorship&lt;/em&gt; model that Russinovich and Hanselman propose in their &lt;em&gt;Communications of the ACM&lt;/em&gt; piece is one shape this can take: a junior paired closely with a senior, with the junior's day-to-day work organised around auditing, prompting, and verifying AI output as a core competency from the first week rather than as a senior-only meta-skill picked up later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-system work.&lt;/strong&gt; The category of work that AI tools are worst at is the work that requires understanding &lt;em&gt;which&lt;/em&gt; abstractions the team has chosen and &lt;em&gt;why&lt;/em&gt;. Codebases ten years old are full of decisions that look strange in isolation and make sense only in the context of the operational history that produced them. A junior tasked with maintaining a long-running system, fixing the incidents, learning why the previous abstractions are there, builds judgment that doesn't emerge from greenfield AI-assisted code generation. The work survives because the codebase predates the AI tools and the AI tools cannot reconstruct the operational history.&lt;/p&gt;

&lt;p&gt;What these patterns share is that the &lt;em&gt;training&lt;/em&gt; component is structurally separated from the &lt;em&gt;production&lt;/em&gt; component. The production work the junior does is no longer differentially valuable on the immediate-quarter timeline; the training work is differentially valuable on the five-to-ten-year timeline. Companies that take the pipeline-math seriously are the ones that will fund the training work as a first-class deliverable rather than as a byproduct of production work that AI tools have made redundant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The summary that matches the data
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Years of experience&lt;/th&gt;
&lt;th&gt;2025 cohort observed&lt;/th&gt;
&lt;th&gt;Implication for 2030&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Junior&lt;/td&gt;
&lt;td&gt;0–2&lt;/td&gt;
&lt;td&gt;SignalFire: hiring at top 15 US firms down 25% YoY 2023→2024; new-grad share of Big Tech hires 32% (2019) → 7% (2024); entry-level postings down 60% 2022→2024; 22–25-year-old developer employment –20% from 2022 peak (Stanford, July 2025 data)&lt;/td&gt;
&lt;td&gt;Smaller pool of mid-level engineers in 2028&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mid-level&lt;/td&gt;
&lt;td&gt;3–5&lt;/td&gt;
&lt;td&gt;Filled by 2020–2022 junior hires (the last full cohort)&lt;/td&gt;
&lt;td&gt;Smaller pool of senior engineers in 2031&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Senior&lt;/td&gt;
&lt;td&gt;6–10&lt;/td&gt;
&lt;td&gt;Filled by 2015–2019 junior hires (full cohorts; the largest available pool the industry has ever had)&lt;/td&gt;
&lt;td&gt;Lateral-hire premium rises sharply 2030–2033; senior-only orgs face contractual leverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Principal&lt;/td&gt;
&lt;td&gt;10+&lt;/td&gt;
&lt;td&gt;Filled by 2015–2019 mid-level hires, several promoting up; the top of the pyramid is currently flush&lt;/td&gt;
&lt;td&gt;The supply of principal-level engineers in 2035 depends on the 2025 mid-level cohort, which depends on the 2020–2022 junior cohort; this is the last point at which the math is fully baked&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The table is a description, not a forecast. Each row's &lt;em&gt;2025 cohort observed&lt;/em&gt; column is reported data from the cited sources. Each row's &lt;em&gt;Implication for 2030&lt;/em&gt; column is the mechanical consequence of the time arithmetic. The forecast component lives in the gap between the two — the assumption that hiring patterns from 2024–2026 will continue, that AI tools will not change in ways that re-open the junior-training path, that companies will not collectively course-correct in time. These are real assumptions, and reasonable people will disagree about each one. The arithmetic does not require the assumptions to be correct in detail; it requires only that the gap between the cohort sizes does not get retroactively filled.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is not, and what it is
&lt;/h2&gt;

&lt;p&gt;The honest answer to &lt;em&gt;should I learn to code in 2026&lt;/em&gt; splits along an axis the discourse has tended to flatten. Coding as a &lt;em&gt;career&lt;/em&gt; — the path from bootcamp through three years of junior work to a mid-level role at a name-brand firm — is structurally narrower than it was. Coding as a &lt;em&gt;general-purpose intellectual skill&lt;/em&gt; — reading what an AI assistant produces, verifying its output, automating the small things that bother you — is more useful than ever, partly because the AI tools are most useful to people who can read what they produce.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Should you spend three years on a CS degree to enter the entry-level dev market in 2027?&lt;/em&gt; The market looks very different from your older sibling's. &lt;em&gt;Should you learn to code well enough to read what an AI assistant produces?&lt;/em&gt; That one has a much clearer affirmative answer, and the people who can do it are well-positioned for the kind of work the pipeline math is &lt;em&gt;not&lt;/em&gt; eliminating.&lt;/p&gt;

&lt;p&gt;The pipeline crisis is real on the timeline the data describes. The career advice is more local; the local answer depends on which side of the pipeline the reader is positioned on. Both can be true.&lt;/p&gt;

</description>
      <category>juniordevelopers</category>
      <category>career</category>
      <category>aiimpact</category>
      <category>labour</category>
    </item>
    <item>
      <title>The Magic Behind the Screen</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Tue, 16 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/the-magic-behind-the-screen-gkk</link>
      <guid>https://dev.to/arthurpro/the-magic-behind-the-screen-gkk</guid>
      <description>&lt;p&gt;Mercedes-Benz, the &lt;a href="https://www.drive.com.au/news/mercedes-benz-commits-to-bringing-back-phycial-buttons/" rel="noopener noreferrer"&gt;drive.com.au reporting&lt;/a&gt; by Matt Adams informed us on May 3, has committed to bringing back physical buttons in its upcoming GLC and C-Class models. The company joins Audi, Volkswagen, Mazda, and a steadily lengthening list of carmakers admitting that the era of touch-everything dashboards was, in retrospect, a mistake. The story arrived at the front page of &lt;a href="https://news.ycombinator.com/item?id=47997418" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt; shortly after publication, where it accumulated 797 points and 452 comments, the bulk of them written by people who would like to say they told you so and have, with some patience, been telling you so for ten years.&lt;/p&gt;

&lt;p&gt;The Mercedes announcement is structured as a customer-led correction. &lt;em&gt;"Customers told us two years ago,"&lt;/em&gt; the company's sales chief &lt;strong&gt;Mathias Geisen&lt;/strong&gt; &lt;a href="https://www.autocar.co.uk/car-news/technology/mercedes-reintroduce-buttons-%E2%80%93-stick-big-screens" rel="noopener noreferrer"&gt;told Autocar's James Attwood&lt;/a&gt; on April 27, &lt;em&gt;"'guys, nice idea, but it just doesn't work for us', so we changed that and made it more analogue."&lt;/em&gt; It is a reasonable thing to say. It is also a much smaller thing than the situation it describes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually told
&lt;/h2&gt;

&lt;p&gt;There is, in the public reporting around this announcement, a small piece of context that Mercedes' framing folds away. Customers were not the only party telling the company that touch-everything dashboards did not work. The European New Car Assessment Programme — Euro NCAP, the body whose five-star safety ratings drive a non-trivial fraction of the European new-car market — &lt;a href="https://www.euroncap.com/press-media/euro-ncap-announces-2026-protocol-changes-to-tackle-modern-driving-risks/" rel="noopener noreferrer"&gt;announced in November 2025&lt;/a&gt; that its 2026 testing protocol would assess Human-Machine Interface design including &lt;em&gt;"the availability of physical buttons for commonly used functions, which consumer feedback suggests can reduce distraction."&lt;/em&gt; Vehicles scoring highest, the protocol indicates, will be ones that demonstrate accessible physical controls. The reigning informal industry rule of thumb — that you cannot sell a car in Europe without a five-star Euro NCAP rating, or at least cannot sell it at the price point Mercedes wants to sell at — gives the announcement direct commercial weight.&lt;/p&gt;

&lt;p&gt;Several HN commenters, working from a shared awareness of how this kind of announcement actually gets made, pointed out the parallel pressure from China, where new vehicle regulations are reportedly requiring physical controls for some functions over the same window. Some did not bother being polite about the framing: &lt;em&gt;"Is is Mercedes-Benz deciding to bring back buttons,"&lt;/em&gt; one commenter put it, &lt;em&gt;"or is it that the EU's NCAP safety rating mandated that they bring back buttons, and they are spinning it as a voluntary decision?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpickles.news%2Fposts%2Fmagic-behind-the-screen%2F7959f9e6-3ee0-5431-960e-8fec85c50000.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpickles.news%2Fposts%2Fmagic-behind-the-screen%2F7959f9e6-3ee0-5431-960e-8fec85c50000.jpg" title="Mercedes-Benz interior" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The customer-feedback story is the one Mercedes wants on the press release. It is also the one Mercedes can offer without having to publicly concede that the previous design was actively dangerous, which is the part of the story Euro NCAP's 2026 protocol exists to encode. &lt;em&gt;"It just doesn't work for us"&lt;/em&gt; is the version of the user complaint that fits inside a company-led narrative arc. The Euro NCAP version — that the highest-rated vehicles must now demonstrate physical buttons for commonly used functions, because consumer feedback indicates this reduces distraction — is the regulatory version. They are describing the same physical fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  The sentence that explains everything
&lt;/h2&gt;

&lt;p&gt;The most quoted sentence from Geisen's Autocar interview — and the one that received, on HN, more sustained ridicule than the rest of the announcement combined — is the line in which he attempted to articulate Mercedes' continued faith in screens despite the partial reversal: &lt;em&gt;"I'm a big believer in screens, because I really believe if you want to connect, you have to make the magic work behind the screen."&lt;/em&gt; It is worth pausing on this sentence, because it is the sentence that explains why the previous decade of automotive interior design happened.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpickles.news%2Fposts%2Fmagic-behind-the-screen%2Fc0e9897c-e849-52c9-a94b-6567eda50000.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpickles.news%2Fposts%2Fmagic-behind-the-screen%2Fc0e9897c-e849-52c9-a94b-6567eda50000.jpg" title="Mercedes-Benz touchscreen interface" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The sentence does not parse well in any literal sense. &lt;em&gt;"Magic work behind the screen"&lt;/em&gt; is an attempt to gesture at the domain in which a sales executive's instincts most natively operate, which is the domain of &lt;em&gt;connecting with customers&lt;/em&gt; in a sales sense, where a phone-like interface is read as inherently aspirational and an analog one is read as inherently retrograde. One HN commenter, with the relief of someone who has been waiting for the right occasion to use a particular framing, observed that the sentence's parsing failure was the entire point: &lt;em&gt;"I am a big believer in keeping “product people” away from UI design for dangerous machinery."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The framing is harsh, but the diagnosis is exactly correct. The sentence Geisen produced is not the sentence of someone designing a vehicle to be operated safely at speed. It is the sentence of someone designing a piece of hardware to feel, as a shopping experience, like a smartphone. The two design briefs produce different artifacts. The smartphone-first brief produces a 39.1-inch &lt;em&gt;Hyperscreen&lt;/em&gt; covering the entire width of the dashboard. The safety-first brief produces a knob you can find with your hand without taking your eyes off the road. For a long stretch of the 2010s and 2020s, the auto industry chose the first brief. It is now being told by regulators, by customer surveys, by accident-and-injury data, and — only in the last twelve months — by its own sales numbers, that the second brief was the one it was supposed to be working from all along.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the previous answer looked like
&lt;/h2&gt;

&lt;p&gt;Multiple HN commenters, each independently, raised a fact about the dashboard-design problem that has the curious property of being old enough that it predates the entire industry detour: ISO 2575, the international standard governing the symbology of automobile dashboard indicator lamps, has been on the books since 1982. It is a forty-three-year-old document. Its function is to ensure that any driver, climbing into any car, can identify a critical condition without reading any text or making any cognitive effort beyond glancing at a known position on the dashboard.&lt;/p&gt;

&lt;p&gt;The HCI literature on attention-management for high-stakes interfaces — pilots, surgeons, machine operators — has spent the same forty-three years discovering, in case after case, that the principles ISO 2575 encoded in 1982 are roughly correct. Tactile feedback is the form of feedback that a user can process while their visual attention is committed to something else. Muscle memory is the form of memory that survives the cognitive load of an actual emergency. Fixed positions are the form of layout that can be operated peripherally. None of these are deep findings. None of them have been overturned by anything subsequent. The auto industry, beginning around 2013, decided to operate as if they had been overturned by the iPhone.&lt;/p&gt;

&lt;p&gt;What the auto industry actually did, when it removed the buttons, is a thing one HN commenter named directly: &lt;em&gt;"screens over buttons is a&lt;/em&gt; cost cutting &lt;em&gt;measure, not a first-principles design decision."&lt;/em&gt; The case is straightforward. A touchscreen is a single hardware part you can manufacture in volume, source from a small number of suppliers, decouple from the physical assembly of the dashboard, and update in software after the vehicle has shipped. A panel of physical controls is dozens of parts each requiring its own tooling, suppliers, electrical harnessing, fit-and-finish testing. Decoupling the UI from the hardware reduces production-pipeline complexity. It also means a UI team can ship updates years after the car has left the factory, which lets the marketing department promise &lt;em&gt;"new features over the air"&lt;/em&gt; in a way that hardware-bound buttons cannot. The case for the touchscreen, on the supplier-side accounting, is real and quantifiable. The case for it on the driver-side accounting is the one that turned out not to hold up.&lt;/p&gt;

&lt;p&gt;Both cases were running simultaneously. One was visible in spreadsheets. The other became visible only after the vehicles had been on the road long enough for the accident-and-injury data to accumulate, for the safety-rating bodies to absorb the pattern, and for the customer-research clinics to surface the &lt;em&gt;"it just doesn't work for us"&lt;/em&gt; reports that Mercedes is now citing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The settings-vs-controls distinction
&lt;/h2&gt;

&lt;p&gt;A more constructive contribution to the HN thread came from a commenter who articulated, in a single move, the principled answer the industry should have arrived at without help: &lt;em&gt;"Settings are great on a touchscreen. A wide variety of options, easily navigated to and explained. They suck on physical buttons, it ends up being like setting the time on a VCR. Controls on the other hand deserve physical buttons. Or levers. or dials/knobs/spinners. It should depend on muscle memory, and the type of control."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the right altitude at which to think about the design problem. The mistake the industry made, in the maximalist period, was conflating &lt;em&gt;settings&lt;/em&gt; — preferences set once and rarely revisited, where a search-and-menu interface is genuinely superior — with &lt;em&gt;controls&lt;/em&gt;, the physical actions a driver performs while operating the vehicle, where any visual interface is at best a degradation and at worst a hazard. The categories sort cleanly once you separate them:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Setting or control?&lt;/th&gt;
&lt;th&gt;Touchscreen ok?&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Enter destination address into navigation&lt;/td&gt;
&lt;td&gt;Setting&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Done while parked; the search-and-list affordance is genuinely superior to a 10-digit keypad&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customise dashboard wallpaper / colour theme&lt;/td&gt;
&lt;td&gt;Setting&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Done once; revisiting is rare; cognitive cost of a menu is acceptable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adjust fan speed when windshield is fogging&lt;/td&gt;
&lt;td&gt;Control&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Done in motion; eyes must remain on road; muscle-memory + tactile feedback dominate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adjust audio volume&lt;/td&gt;
&lt;td&gt;Control&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Done in motion; the 1990s rotary knob was the right answer and remains the right answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Toggle defroster, hazards, traction control&lt;/td&gt;
&lt;td&gt;Control&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Time-critical; ISO 2575-class operation; must be findable without looking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tune navigation map detail level / lane-guidance preferences&lt;/td&gt;
&lt;td&gt;Setting&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Done occasionally; menu-search affordance fits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skip music track&lt;/td&gt;
&lt;td&gt;Control&lt;/td&gt;
&lt;td&gt;Steering-wheel button&lt;/td&gt;
&lt;td&gt;Routine in-motion gesture; muscle memory is the entire interaction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The maximalist touchscreen treated all interactions as if they were settings. The auto industry's design vocabulary, for a decade, treated the categories as interchangeable. They are not.&lt;/p&gt;

&lt;p&gt;What Mercedes is now doing — keeping the giant Hyperscreen, but adding back physical buttons in front of the dual wireless chargers and on the steering wheel — is, awkwardly, the architecture the settings-vs-controls distinction predicts. Settings stay on the screen. Controls — climate, volume, frequently-used cabin functions — return to surfaces a hand can find without the eye following. The implementation is partial; commenters with first-hand experience of the new VW ID-series and the post-facelift Mercedes A-Class noted that some of the newest models have replaced even the wheel-mounted physical buttons with capacitive-touch ones, which exhibits the same failure pattern at smaller scale. But the direction of travel is correct, finally, after a decade in which it was not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost the industry didn't see in the spreadsheet
&lt;/h2&gt;

&lt;p&gt;The thing the touchscreen detour cost, that the industry now has to figure out how to quietly amortize, is not primarily money. It is a decade of vehicles already on the road, owned by people who paid premium prices for them, that are &lt;em&gt;worse to operate&lt;/em&gt; — on the testimony of the owners and reviewers who use them daily — than the cars those same people traded in. The industry-wide regression from analog to touch happened over a period long enough to ensure that an enormous installed base of touch-only vehicles will be on roads, and in resale markets, across the next vehicle-replacement cycle. The owners of those vehicles will not be retroactively given knobs. They will, instead, be given the experience of watching the next generation of cars advertise as a feature the absence of the design choice that defined their own.&lt;/p&gt;

&lt;p&gt;There is no specific accounting line item for this kind of cost. The industry that produced it does not, on the available record, intend to apologize for it. The Mercedes announcement is structured to claim the reversal as evidence of the company's responsiveness to its customers, not as evidence that the previous decade's design language was a structural error. &lt;em&gt;"We listened,"&lt;/em&gt; is the language Mercedes wants in the headlines. &lt;em&gt;"We were wrong, in a way that produced measurably worse outcomes for the people who paid us, for ten years"&lt;/em&gt; is not. The asymmetry is normal corporate speech. It is also the reason this kind of error tends to recur.&lt;/p&gt;

&lt;p&gt;The auto industry will reverse this one over the next five years; the ID.Polo and the new C-Class will arrive with their physical buttons, the Euro NCAP ratings will adjust, the Chinese regulation will take effect, and the press cycle will declare the era of touch-everything dashboards officially over. What is harder to predict is what the &lt;em&gt;next&lt;/em&gt; version of the same mistake looks like. The instinct that produced the maximalist touchscreen — the instinct that said &lt;em&gt;make the car feel like a phone, because phones are the consumer-product surface customers are trained to want&lt;/em&gt; — has not been retired. It has only, momentarily, been overruled. The next opportunity it gets, in some adjacent product category whose safety profile is less easily measured by accident data and whose regulatory body is less vigilant than Euro NCAP, it will produce the same artifact again.&lt;/p&gt;

&lt;h2&gt;
  
  
  What stays
&lt;/h2&gt;

&lt;p&gt;What stays from the Mercedes story, after the C-Class launch and the Euro NCAP rating cycle and the inevitable run of &lt;em&gt;physical-buttons-are-back&lt;/em&gt; trend pieces, is the sentence Geisen produced when asked to explain the screens-and-buttons hybrid future. &lt;em&gt;"I'm a big believer in screens, because I really believe if you want to connect, you have to make the magic work behind the screen."&lt;/em&gt; It is not a sentence about safety, attention, or the actual operation of a vehicle. It is a sentence about how a sales executive, who probably does not drive his own product in heavy weather at speed, models the customer's relationship to the dashboard. The sentence's parsing failure is the diagnostic; the decade of automotive interior design produced under its instinct is the symptom.&lt;/p&gt;

&lt;p&gt;ISO 2575 has been on the books since 1982 and will remain so through whichever fashion cycle replaces this one. The mistake the industry made was assuming that the standard had been made obsolete by a new substrate, rather than recognizing that the standard was about the underlying physics of human attention, which the new substrate did not change. The buttons are coming back because they were never the part that needed to leave.&lt;/p&gt;

&lt;p&gt;The magic, it turns out, doesn't actually have to work behind the screen. It mostly has to work under the driver's right hand, where it always did.&lt;/p&gt;

</description>
      <category>hci</category>
      <category>ux</category>
      <category>automotive</category>
      <category>physicalbuttons</category>
    </item>
    <item>
      <title>Cursor's compression isn't a bug. It's how it works.</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Mon, 15 Jun 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/cursors-compression-isnt-a-bug-its-how-it-works-2680</link>
      <guid>https://dev.to/arthurpro/cursors-compression-isnt-a-bug-its-how-it-works-2680</guid>
      <description>&lt;p&gt;The most useful sentence in &lt;a href="https://cursor.com/blog/dynamic-context-discovery" rel="noopener noreferrer"&gt;Cursor's "Dynamic Context Discovery" blog post&lt;/a&gt; (Jan 6, 2026) is the one written in the kind of plain language engineering teams use when they've decided to admit a trade-off they haven't fully solved:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When the model's context window fills up, Cursor triggers a summarization step to give the agent a fresh context window with a summary of its work so far. &lt;strong&gt;But the agent's knowledge can degrade after summarization since it's a lossy compression of the context.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I keep coming back to that line because of how much it says about the shape of recent agent failures. In late April, a Cursor session running Claude Opus 4.6 issued a single &lt;code&gt;volumeDelete&lt;/code&gt; mutation against PocketOS's production volume on Railway, took the volume's backups with it (Railway stores them in the same blast radius), and produced a "confession" afterwards enumerating which rules it had violated to do it. The agent could &lt;em&gt;cite the rules&lt;/em&gt; in the confession. It just could not, in the moment, connect them to what its hands were doing. The PocketOS founder thread by Jer Crane (@lifeof_jer) laid out the timeline and the exact API call in detail, and several outlets (The Register, Tom's Hardware, Decrypt) reproduced it.&lt;/p&gt;

&lt;p&gt;That part of the post-mortem is what I want to walk through here. It is not really about the model. It is about the harness (the layer between the chat window and the model's context), and specifically what compaction does to the chain of reasoning that's supposed to keep an agent inside its rails.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "compaction" is, in the version Cursor ships
&lt;/h2&gt;

&lt;p&gt;Cursor's harness uses &lt;strong&gt;prompt-based summarization&lt;/strong&gt; for compaction. When the live context approaches the model's window limit, the harness asks the model to summarise its session so far. That summary becomes the seed for a fresh window, and the agent continues from there. (Cursor's other post, &lt;a href="https://cursor.com/blog/self-summarization" rel="noopener noreferrer"&gt;&lt;em&gt;Training Composer for longer horizons&lt;/em&gt;&lt;/a&gt;, Mar 17, 2026, describes how their in-house Composer model is RL-trained with compaction as part of the training loop, but Composer is Composer. Claude Opus running through Cursor gets the generic prompt-based version.)&lt;/p&gt;

&lt;p&gt;The Cursor Forum has known about the timing being off for months. A user posted in &lt;a href="https://forum.cursor.com/t/compaction-not-happening-soon-enough/149490/3" rel="noopener noreferrer"&gt;thread 149490&lt;/a&gt; that on Opus 4.5, "in prior builds summarization would happen at 70-80%. But this time I ran up into the 90% mid action, and it's showing 100% full!" A Cursor staff member replied: "This is a known issue with auto-summarization. It can trigger late or incorrectly. The team is aware of it. Workaround: try running &lt;code&gt;/summarize&lt;/code&gt; manually when you see the context getting close to 70 to 80%."&lt;/p&gt;

&lt;p&gt;Read that twice. The vendor is asking the user to drive a heuristic that the harness was supposed to drive autonomously, because the heuristic doesn't fire reliably. That alone is not the story. The story is that &lt;strong&gt;even when compaction fires correctly, the resulting context is structurally different from the one the model was reasoning in two seconds earlier&lt;/strong&gt;, and the chat window does not tell you that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the structural difference matters
&lt;/h2&gt;

&lt;p&gt;Two threads of research converge here, and they predict exactly the failure mode operators see in the wild.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thread 1: position effects in long contexts.&lt;/strong&gt; Liu et al.'s &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;&lt;em&gt;Lost in the Middle&lt;/em&gt;&lt;/a&gt; (2023) showed the U-shaped curve that everyone now cites: performance is best when relevant information sits at the start or end of the window, and degrades sharply in the middle. The system prompt sits at the start. The current task and tool output sit at the end. Any safety rule whose binding force depends on a chain (&lt;em&gt;rule R says don't do X; this action **is&lt;/em&gt;* an X-like action; therefore don't*) becomes brittle when the &lt;em&gt;application&lt;/em&gt; of the rule has to traverse the middle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thread 2: input length itself hurts, even with perfect retrieval.&lt;/strong&gt; Du et al.'s &lt;a href="https://arxiv.org/abs/2510.05381" rel="noopener noreferrer"&gt;&lt;em&gt;Context Length Alone Hurts LLM Performance Despite Perfect Retrieval&lt;/em&gt;&lt;/a&gt; (EMNLP 2025) is the more uncomfortable one. The authors set up a benchmark where the model is given the relevant evidence, the relevant evidence is positioned right next to the question, and the irrelevant filler is masked out: every fair-fight condition you would design if you wanted to give long context every chance to succeed. Performance still drops 13.9% to 85% as input length grows. "Even when models can perfectly retrieve all relevant information, their performance still degrades substantially as input length increases." Their proposed mitigation is &lt;em&gt;recite before solve&lt;/em&gt;: have the model restate the relevant facts in a short scratchpad, then answer. Convert long context back to short context. On RULER, this gave up to +4 points for GPT-4o.&lt;/p&gt;

&lt;p&gt;If you put those two threads together, you get the prediction Cursor's operators keep finding: compaction does not just lose facts. It dissolves the &lt;em&gt;relationships&lt;/em&gt; between facts. The rule survives the summary as a fragment ("there are some safety rules"). The action survives as a directive ("fix the credential mismatch"). The arc that connects them, &lt;em&gt;and this rule binds this action&lt;/em&gt;, does not. The model's chain-of-thought picks up at the action end and never visits the rule end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anthropic agrees, on the record
&lt;/h2&gt;

&lt;p&gt;The thing that surprised me when I went looking is how on-the-record Anthropic is about all of this. Their &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;&lt;em&gt;Effective Context Engineering&lt;/em&gt;&lt;/a&gt; post (Sep 29, 2025) names the phenomenon directly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Studies on needle-in-a-haystack style benchmarking have uncovered the concept of context rot: as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases. While some models exhibit more gentle degradation than others, this characteristic emerges across all models.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The same post tells you what to do about it: pursue "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." Not "fill the window because the window is large." A passage in Anthropic's &lt;a href="https://platform.claude.com/docs/en/build-with-claude/context-windows" rel="noopener noreferrer"&gt;API documentation&lt;/a&gt; is even blunter: "more context isn't automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as &lt;em&gt;context rot&lt;/em&gt;." Until March 2026, Anthropic priced this directly: requests over 200K tokens cost 2x input and 1.5x output, an implicit declaration that 200K was the reliability boundary they were comfortable selling.&lt;/p&gt;

&lt;p&gt;The cleanest external evidence for how steep the cliff is comes from a single reporter on &lt;a href="https://github.com/anthropics/claude-code/issues/35296" rel="noopener noreferrer"&gt;anthropics/claude-code issue #35296&lt;/a&gt;, opened March 17, 2026. The reporter ran 25+ transcripted sessions with Claude Opus 4.6 against a 20,000-record database and pinned down a behaviour profile by context-fill percentage:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context fill&lt;/th&gt;
&lt;th&gt;Behaviour observed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0–20%&lt;/td&gt;
&lt;td&gt;Reliable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20–40%&lt;/td&gt;
&lt;td&gt;Degrading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40–60%&lt;/td&gt;
&lt;td&gt;Unreliable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;60–80%&lt;/td&gt;
&lt;td&gt;Broken&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;80–100%&lt;/td&gt;
&lt;td&gt;Irrecoverable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The same issue cites Anthropic's own MRCR v2 multi-needle benchmark: 93% accuracy at 256K, 76–78% at 1M. Roughly one in four multi-needle retrievals fails at the advertised maximum window. None of this is hidden. It is in Anthropic's docs, on Anthropic's blog, and in Anthropic's pricing history. It is just not in the chat window.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an honest UI for context loss would look like
&lt;/h2&gt;

&lt;p&gt;The thing that makes compaction unusually dangerous is that the user has no idea it has happened. The chat scrolls. Earlier turns are still visible above the fold. The model still answers in the same voice. Nothing in the interface signals that the context the model is &lt;em&gt;currently&lt;/em&gt; reasoning over is no longer the context the user thinks they share with it.&lt;/p&gt;

&lt;p&gt;Compare that to other places software handles state-loss. When a database connection drops and reconnects, the client logs it. When a process restarts, systemd records the restart in the journal. When git rebases your branch, it tells you which commits moved. Compaction, by contrast, is an invisible state transition. The agent's "memory" gets replaced with a paraphrase of the original, and the chat window does not draw a line.&lt;/p&gt;

&lt;p&gt;What I would want, as an operator, is something boringly straightforward: a banner before compaction fires that tells me the budget is about to be reset, an inline marker in the transcript at the point compaction occurred, and a one-click "diff" view that shows me what survived in the summary versus what was in the original. None of this is hard to build. You can prototype the budget half in a couple of dozen lines of Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContextBudget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pre-compaction warning gate for an agent harness.

    Wrap your prompt-assembly with this and call .check() before each
    model call. It does not implement compaction itself; the point is
    to give the operator a chance to /summarize on their own terms,
    not to have the harness silently re-summarise mid-task.

    Call .mark_compacted() from your operator&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s /summarize path so
    the next .check() can report when the last reset happened.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;WARN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;   &lt;span class="c1"&gt;# Cursor staff's recommended manual-/summarize point
&lt;/span&gt;    &lt;span class="n"&gt;HARD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;   &lt;span class="c1"&gt;# below the harness's own auto-trigger, with margin
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encoding_for_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_compaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;measure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mark_compacted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_compaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;measure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HARD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;CompactionRequired&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manual /summarize required before next call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WARN&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;since&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_compaction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s ago&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_compaction&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;never&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[budget] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;used&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;); consider /summarize &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(last compaction: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;used&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CompactionRequired&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point of a wrapper like that is not the arithmetic. The arithmetic is the easy part. The point is that the operator gets to see the budget, the operator is the one who decides when to compact, and the moment compaction happens is logged into the transcript as an event the operator can scroll back to. That much would close the gap between "model's working context" and "what the user thinks they're chatting with." The rest of the honest-UI agenda (diffing the pre- and post-summary transcripts, marking which parts of system prompt survived the summary, surfacing the compaction event in the same way Slack surfaces a thread split) falls out of having an explicit compaction event in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for the rule-binding problem
&lt;/h2&gt;

&lt;p&gt;Bring this back to the failure mode in the PocketOS incident. The agent had safety rules in the system prompt. It had a destructive operation available. Some non-trivial number of tokens of intermediate work (file reads, shell output, grep results) accumulated between those two ends of the context. When compaction fired, the rules got summarised into "there are some safety rules." The action got summarised into "fix the credential mismatch by deleting the volume." The chain that should have stopped the action &lt;em&gt;because&lt;/em&gt; of the rule got summarised into nothing in particular.&lt;/p&gt;

&lt;p&gt;You can build a defence against that at three levels, and the punch line is that none of them is "use a smarter model." You can build it at the &lt;strong&gt;harness&lt;/strong&gt; level (recite-before-solve before destructive actions; restate the active rules into the model's working scratchpad immediately before tool use). You can build it at the &lt;strong&gt;API gateway&lt;/strong&gt; level (out-of-band confirmation for destructive mutations; scoped tokens that physically cannot reach production from a staging task). You can build it at the &lt;strong&gt;UI&lt;/strong&gt; level (visible compaction events; the operator chooses when, not the harness). Each level catches a different version of the same failure. The cheap version of all three together is more reliable than waiting for the next model release to "just handle longer contexts," because the next model release will have the same shape of failure at a different threshold. Context rot, in Anthropic's own framing, "emerges across all models."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Defence layer&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;th&gt;Concrete pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Harness&lt;/td&gt;
&lt;td&gt;Rule-binding lost during compaction&lt;/td&gt;
&lt;td&gt;Recite-before-solve: restate active safety rules into a fresh scratchpad before any destructive tool call (Du et al. 2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API gateway&lt;/td&gt;
&lt;td&gt;Destructive mutation reaches the API at all&lt;/td&gt;
&lt;td&gt;Out-of-band confirmation; scoped tokens that physically cannot reach prod from a staging credential&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI&lt;/td&gt;
&lt;td&gt;Operator can't see that context was compressed&lt;/td&gt;
&lt;td&gt;Pre-compaction banner; inline transcript marker; pre/post summary diff view&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model&lt;/td&gt;
&lt;td&gt;(Don't rely on this layer.)&lt;/td&gt;
&lt;td&gt;Better long-context attention is research, not a deployment plan&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I'm taking from this
&lt;/h2&gt;

&lt;p&gt;The frame that helps me hold all of this in my head is to stop thinking of compaction as a bug. It's not a bug. Cursor's blog post calls it "lossy compression of the context" using exactly that wording. Anthropic's blog post says context rot is universal. Du et al.'s benchmark says even &lt;em&gt;perfect&lt;/em&gt; retrieval over a long context underperforms a short one. Three independent sources, three different framings, one underlying claim: the agent's working context is not the conversation you had with it. It's a derivative of that conversation, and the derivative is approximate, and the approximation is the part that fails.&lt;/p&gt;

&lt;p&gt;The prior incident I wrote about wasn't a hallucination event. It was a structural one: a long-running session where the link between the rule and the action got summarised away. The next one will have the same shape. The thing the industry will learn this year (late, the way it learned that retries need bounds and that connectors need monitoring) is that the chat window is a UI for the user, not for the model. The model has a different UI, and right now nobody is showing it to anyone.&lt;/p&gt;

</description>
      <category>cursor</category>
      <category>aiagents</category>
      <category>contextcompression</category>
      <category>llm</category>
    </item>
    <item>
      <title>What Zed Shipped in the First Ten Days After 1.0</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Mon, 15 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/what-zed-shipped-in-the-first-ten-days-after-10-30k8</link>
      <guid>https://dev.to/arthurpro/what-zed-shipped-in-the-first-ten-days-after-10-30k8</guid>
      <description>&lt;p&gt;&lt;a href="https://zed.dev/blog/zed-1-0" rel="noopener noreferrer"&gt;Ten days ago, on April 29, the Zed editor reached version 1.0&lt;/a&gt;. The team had been working toward that milestone for five years. The piece I wrote that day, &lt;a href="https://pickles.news/posts/zed-is-1-0/" rel="noopener noreferrer"&gt;&lt;em&gt;Zed Is 1.0 — and the Electron Era Just Ended&lt;/em&gt;&lt;/a&gt;, was about why the foundation of the editor was the news: a native, GPU-accelerated, Rust-built code editor with no Chromium underneath, ready for the developers who passed on it during the long preview.&lt;/p&gt;

&lt;p&gt;This piece is about what happened next.&lt;/p&gt;

&lt;p&gt;In the ten days from April 29 through May 8, Zed shipped four stable releases after 1.0, posted four blog entries, launched a paid Business plan, opened a public conversation about why the team is investing in AI at all, and released a new edit-prediction model that uses about a third as many tokens as the one it replaced. None of those things were on the launch slide for 1.0. All of them landed in the time it takes most software teams to argue about a sprint goal.&lt;/p&gt;

&lt;p&gt;The cadence is the story. The features are how you read it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ten days, six shipping events
&lt;/h2&gt;

&lt;p&gt;Here is the calendar, in the order things actually happened. The dates are pulled from &lt;a href="https://zed.dev/releases/stable" rel="noopener noreferrer"&gt;Zed's own stable-release page&lt;/a&gt; and &lt;a href="https://zed.dev/blog" rel="noopener noreferrer"&gt;the team's blog&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Release / event&lt;/th&gt;
&lt;th&gt;What shipped&lt;/th&gt;
&lt;th&gt;Why a normal user notices&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Apr 29&lt;/td&gt;
&lt;td&gt;1.0.0&lt;/td&gt;
&lt;td&gt;Five years of work declared stable&lt;/td&gt;
&lt;td&gt;Foundation is no longer marked "preview"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 4&lt;/td&gt;
&lt;td&gt;1.0.1&lt;/td&gt;
&lt;td&gt;Agent edit-apply fix&lt;/td&gt;
&lt;td&gt;Agentic code edits stop silently failing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 5&lt;/td&gt;
&lt;td&gt;Blog: &lt;em&gt;We're Not Building AI Features for the Money&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Philosophy post on why AI is in Zed&lt;/td&gt;
&lt;td&gt;Counter-narrative to vendor-AI hype&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 6&lt;/td&gt;
&lt;td&gt;1.1.5 + Zed for Business&lt;/td&gt;
&lt;td&gt;Panel layout switcher (classic / agentic), git graph view, split diff in agent and file diff panels, LSP code lens, Helix amp jump navigation, DeepSeek V4-Pro/Flash and OpenCode Go provider support — &lt;em&gt;plus&lt;/em&gt; a $30-per-seat Business plan with org-wide AI controls&lt;/td&gt;
&lt;td&gt;The editor for teams now exists, and the headline interaction surface changed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 6&lt;/td&gt;
&lt;td&gt;1.1.6&lt;/td&gt;
&lt;td&gt;Windows ACP-launch fix, Linux inotify event-queue overflow fix&lt;/td&gt;
&lt;td&gt;Zed actually works on Windows and on busy Linux trees&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 8&lt;/td&gt;
&lt;td&gt;1.1.7 + Zeta2.1&lt;/td&gt;
&lt;td&gt;zeta2 prompt-format fix, filesystem-error CPU regression fix, Helix-motion panic fix, markdown-preview reload — &lt;em&gt;plus&lt;/em&gt; a new edit-prediction model with 67% fewer output tokens and 28% lower median latency&lt;/td&gt;
&lt;td&gt;Suggestions feel snappier and the editor stops eating CPU on a broken symlink&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A quick reading of that table is enough to see the pattern. May 6 is the loud day: a feature release, a Business plan, and a same-day patch chasing the bugs the feature release surfaced on Windows and Linux. May 8 is the quieter substantive day: a small bugfix release in the foreground and a new AI model in the background, shipped together because the model and the editor have to land at the same time for either to work.&lt;/p&gt;

&lt;p&gt;There are also two version numbers that did not happen. Zed went from 1.0.1 to 1.1.5 without 1.1.0 through 1.1.4 ever being promoted to the stable channel. Those numbers existed; they were preview-channel cuts, real builds with real changes, that the team chose not to push to every user. The decision to skip them is its own piece of information about the cadence: Zed runs a fast preview channel and a careful stable channel, and lets the users who like the cliff edge ride the preview while the rest get a smaller number of stable promotions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The May 6 story: the day Zed became a business
&lt;/h2&gt;

&lt;p&gt;The most significant thing that happened in the ten-day window did not have a version number. On May 6, &lt;a href="https://zed.dev/blog/zed-for-business" rel="noopener noreferrer"&gt;Zed Industries announced Zed for Business&lt;/a&gt; — a $30-per-seat-per-month plan aimed at teams that want central control over the AI defaults their engineers can flip.&lt;/p&gt;

&lt;p&gt;The shape of the plan is worth reading carefully. Companies can &lt;a href="https://zed.dev/blog/zed-for-business" rel="noopener noreferrer"&gt;bring their own API keys from Anthropic, OpenAI, Google, or AWS without an additional Zed markup, or use Zed-hosted AI billed at provider cost plus 10%&lt;/a&gt;. Prompt sharing and edit-prediction training are off by default at the organisation level, and individual engineers cannot override that setting. Administrators can disable Zed-hosted models, edit predictions, and collaboration features for the whole organisation, and set spend limits on tokens.&lt;/p&gt;

&lt;p&gt;That last detail is the one that makes the Business plan more than a SKU. The privacy guarantees normal users have always had on Zed — no prompt storage by default, no training on your code by default — are now enforceable as policy. A security team can lock them on. The individual engineer cannot opt back into "share my prompts" by accident on a Tuesday afternoon. That is not the same product as "Zed with AI features turned on" — it is a meaningfully different artefact aimed at a different buyer.&lt;/p&gt;

&lt;p&gt;The same day, &lt;a href="https://zed.dev/releases/stable/1.1.5" rel="noopener noreferrer"&gt;release 1.1.5 added the panel layout switcher between a &lt;em&gt;classic&lt;/em&gt; IDE arrangement and an &lt;em&gt;agentic&lt;/em&gt; one&lt;/a&gt;. The two layouts are both first-class. You pick which one matches the work you are doing in the moment. A debugger session in classic; a multi-agent refactor in agentic. The editor stops insisting that there is one right way to lay out the screen.&lt;/p&gt;

&lt;p&gt;Then, while the new layout was rolling out, Windows users on certain configurations could not launch their Agent Client Protocol agents at all, and Linux users on busy trees were hitting inotify event-queue overflows. &lt;a href="https://zed.dev/releases/stable/1.1.6" rel="noopener noreferrer"&gt;Version 1.1.6 shipped the same day&lt;/a&gt; to fix both. The polite reading is "they caught the Windows and Linux bugs in their dogfooding within hours and pushed a fix that afternoon." The honest reading is the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  The May 8 story: a smaller, faster brain
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://zed.dev/edit-prediction" rel="noopener noreferrer"&gt;Zeta is the model that powers Zed's edit prediction&lt;/a&gt; — the inline ghost-text suggestions that appear as you type, that you accept with Tab. It is not the agent. It is the smaller, lower-latency thing that runs continuously in the background, trying to keep up with where your cursor is going.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://zed.dev/blog/zeta2-1" rel="noopener noreferrer"&gt;On May 8, Zed posted Zeta2.1&lt;/a&gt;. The numbers are the headline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Output tokens dropped from about 270 to about 90 — a 67% reduction.&lt;/li&gt;
&lt;li&gt;Median latency dropped from 189 ms to 136 ms — about 28% faster.&lt;/li&gt;
&lt;li&gt;Acceptance rate improved by 0.51%; explicit-rejection rate fell by 4.10%.&lt;/li&gt;
&lt;li&gt;Infrastructure footprint dropped by roughly 30% — fewer servers carrying the same traffic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The technical change is a new prompt format the team calls &lt;em&gt;Multi-Region&lt;/em&gt;. The previous version had the model output a large region around your cursor with the model's edits applied; the new one only outputs the slice of code that actually changed. The model has the same amount of context going in. It says less coming out. Less to generate, less to send over the wire, less to render on screen.&lt;/p&gt;

&lt;p&gt;For someone using the editor, the practical consequence is: suggestions feel slightly snappier, and the model says yes more often when you accept the suggestion. The deeper consequence is in &lt;a href="https://zed.dev/blog/zeta2-1" rel="noopener noreferrer"&gt;the model's open-weight release on Hugging Face&lt;/a&gt;, trained on opt-in open-source data. The model that ships in your editor is the same model anyone can download, inspect, and run independently. That is an unusual posture for a feature in a 2026 IDE. It is also the posture the Zed team has been talking about for several years.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://zed.dev/releases/stable/1.1.7" rel="noopener noreferrer"&gt;The same day, version 1.1.7 closed out the small bugs in the foreground&lt;/a&gt; — including a fix for local Zeta2 edit predictions, which had been using the wrong prompt format.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Zed Guild
&lt;/h2&gt;

&lt;p&gt;The piece that shows up in most release notes only as credits at the bottom, rather than as a headline item, and that I think matters more than most of what does get the headline framing, is the &lt;a href="https://zed.dev/community/zed-guild" rel="noopener noreferrer"&gt;Zed Guild&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The Guild is a twelve-week cohort program for outside contributors. Selected applicants pair with a Zed engineer for the duration of the cohort and ship features into the actual repository. The first cohort has finished. The page that describes the program is, by 2026 standards, almost embarrassingly low on marketing copy: a paragraph of program description, a wall of GitHub avatars from cohort members, and a closed application window.&lt;/p&gt;

&lt;p&gt;The reason this matters for an article about ten days of shipping is that ten days of shipping at this density is not something an in-house team produces by itself. The 1.1.5 release notes credit a long list of community changes alongside the marquee features. The Guild is one of the legible mechanisms by which that list gets longer. It is also a quietly important answer to the question every editor that wants to outlast its founders eventually has to answer: &lt;em&gt;who else cares about this codebase enough to keep it healthy when the founders eventually move on?&lt;/em&gt; Atom, the editor that taught the Zed founders what they did not want to build again, was killed by its corporate owner in 2022. The Guild is a slow, careful bet on building a constituency that does not depend on a single corporate owner staying willing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the cadence is the actual story
&lt;/h2&gt;

&lt;p&gt;The reason the calendar matters is that the dominant editor-category competition in 2026 is between three different theories of what an editor is for, and the theories ship on three different clocks.&lt;/p&gt;

&lt;p&gt;VS Code wins on inertia. The integrated bet is that once you and your team have invested in extensions, settings, and muscle memory, you do not switch even when something better appears. Microsoft can ship at any pace because the customer's switching cost is already doing the work.&lt;/p&gt;

&lt;p&gt;JetBrains wins on completeness. The integrated bet is that once you need refactoring, database tooling, and language intelligence at IntelliJ depth, you accept the long startup time and the heavy memory footprint, because nothing else covers the same ground. JetBrains can ship a major IDE on a multi-month rhythm because the alternative is not catching up.&lt;/p&gt;

&lt;p&gt;Zed is trying to win on momentum. The integrated bet is that an editor that meaningfully improves every two weeks pulls users toward it the way a static editor cannot pull them, because the gap between &lt;em&gt;what your editor was a month ago&lt;/em&gt; and &lt;em&gt;what it is today&lt;/em&gt; is large enough to feel. That bet only works if the cadence is sustainable. The ten-day window between 1.0 and 1.1.7 is the first public proof that the cadence is real on the stable channel, not just in preview, and not just for one release. Five stable releases, four blog posts, a Business plan, a new model, and an open community program — that is what the bet looks like when it lands.&lt;/p&gt;

&lt;p&gt;It is too early to call the bet won. Six months of shipping at this density would be the harder test. Three months of shipping at this density while the team stops running on launch adrenaline would be the test after that. What we have today is ten days of evidence that the cadence is being delivered.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do with this
&lt;/h2&gt;

&lt;p&gt;If you have been waiting on the original &lt;em&gt;should I switch&lt;/em&gt; question, my answer is the same as it was ten days ago: do not switch your daily-driver editor in the middle of a project, but it is now a reasonable thing to put on the side project you start next month.&lt;/p&gt;

&lt;p&gt;The thing that has shifted is the watch interval. The right amount of attention to give Zed in May 2026 is roughly &lt;em&gt;check the changelog once a fortnight and see whether anything you would actually use has landed.&lt;/em&gt; That is not how I have ever paid attention to an editor in twenty years of doing this work. It is the right amount of attention to pay to this one.&lt;/p&gt;




&lt;p&gt;Ten days. Five stable releases. Four blog posts. A Business plan. A smaller, faster AI model. A community program that finished its first cohort. None of it was on the launch slide.&lt;/p&gt;

&lt;p&gt;The launch was the foundation. The foundation is now visibly carrying weight. The next month or two will tell us whether the people on top of it can keep stacking, or whether the rhythm slows and the pattern shifts to the one we have all seen before. For now, the rhythm has not slowed. That alone is worth saying out loud.&lt;/p&gt;

</description>
      <category>editors</category>
      <category>ide</category>
      <category>zed</category>
      <category>aicoding</category>
    </item>
    <item>
      <title>DigitalOcean vs Vultr: The AWS Alternatives Small Businesses Actually Need</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Fri, 12 Jun 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/digitalocean-vs-vultr-the-aws-alternatives-small-businesses-actually-need-67</link>
      <guid>https://dev.to/arthurpro/digitalocean-vs-vultr-the-aws-alternatives-small-businesses-actually-need-67</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A quick note on the links below.&lt;/strong&gt; The DigitalOcean and Vultr links in this article are referral links. If you sign up via them, you get a free credit on your new account (currently $200 over 60 days for DigitalOcean and up to $300 for Vultr) and the author of this article gets a small referral credit too, at no extra cost to you. AWS does not run an equivalent referral program, so the AWS links are normal links. The review below is the author's own evaluation; the credits do not change the recommendations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you have ever spent a workday watching your website refuse to load, you are not alone. In a &lt;a href="https://pickles.news/posts/the-thermal-event-that-took-half-the-internet-with-it/" rel="noopener noreferrer"&gt;recent outage&lt;/a&gt;, a single building in Northern Virginia hosting one of Amazon's availability zones (the cloud-industry term for &lt;em&gt;one campus's worth of servers in one region&lt;/em&gt;) got too hot. The hardware shut itself down. AWS calls this a &lt;em&gt;thermal event.&lt;/em&gt; Customers around the world have other names for it.&lt;/p&gt;

&lt;p&gt;Big enterprises ride out outages like this. They have multi-region setups, dedicated SRE teams, and SLA credits that will refund a small fraction of their monthly bill. Small and mid-size businesses do not. They lose a day of revenue, scramble to reassure customers, and then read a post-mortem in a few weeks that explains what went wrong in language that does not help them recover the lost revenue.&lt;/p&gt;

&lt;p&gt;The cloud was supposed to make small businesses look big. After each new outage, it is fair to ask: is AWS actually the right cloud for small businesses at all? Two providers worth a serious look, &lt;strong&gt;DigitalOcean&lt;/strong&gt; and &lt;strong&gt;Vultr&lt;/strong&gt;, are simpler, cheaper at the entry level, and built around use cases that more closely match what a small business actually needs. Here is what each one does, where AWS is still the right answer, and how to decide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AWS hits small businesses harder than big ones
&lt;/h2&gt;

&lt;p&gt;When a giant company has an AWS outage, three teams kick into gear. There is the engineering team that fails workloads over to a backup region. There is the customer-success team that updates the status page. And there is the finance team that calculates the SLA-credit recovery against the contract.&lt;/p&gt;

&lt;p&gt;Now imagine a small business. There is the founder, who is the engineering team, the customer-success team, and the finance team all at once. There is no backup region, because setting one up costs money the business does not have to spend on standby capacity. The SLA credit, if it ever lands, is a refund of the affected service's monthly bill, which for most small businesses is well under a hundred dollars. The actual loss is the day's missed orders, the customer-trust damage, and the hours the founder spent updating people on Slack instead of running the business.&lt;/p&gt;

&lt;p&gt;This is not a complaint about AWS. AWS is built for scale. The reason it has hundreds of services, dozens of EC2 instance types, and an entire skill profession around managing IAM permissions is that big customers need all of those things. The mismatch is on the small-business side. If you do not need that breadth of services and you cannot afford to architect for multi-region failover, you are paying for a fire-truck to deliver groceries.&lt;/p&gt;

&lt;h2&gt;
  
  
  When AWS is genuinely overkill
&lt;/h2&gt;

&lt;p&gt;If you run any of the following, you almost certainly do not need full AWS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A small website with predictable traffic.&lt;/li&gt;
&lt;li&gt;A SaaS product with a few thousand users.&lt;/li&gt;
&lt;li&gt;An e-commerce store with a normal product catalog.&lt;/li&gt;
&lt;li&gt;A WordPress site, a simple Rails or Django app, a static landing page with a contact form.&lt;/li&gt;
&lt;li&gt;An internal tool used by a team of fewer than fifty people.&lt;/li&gt;
&lt;li&gt;A side project, a personal blog, a portfolio.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In each of these cases, the AWS console is mostly an obstacle between you and the thing you are trying to do. The pricing is harder to predict. The default settings are not optimized for your workload. The documentation is excellent in places and bewildering in others. And failure modes like a thermal event in one availability zone propagate to your business in ways you have no architectural levers to absorb.&lt;/p&gt;

&lt;p&gt;The right cloud for these workloads is one that is simpler, has predictable monthly pricing, and treats &lt;em&gt;getting started&lt;/em&gt; as a first-class problem rather than as something for the customer to figure out.&lt;/p&gt;

&lt;h2&gt;
  
  
  DigitalOcean
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://m.do.co/c/f3df14647b88" rel="noopener noreferrer"&gt;DigitalOcean&lt;/a&gt; was founded in 2012 in New York City by brothers Ben and Moisey Uretsky together with Mitch Wainer, Jeff Carr, and Alec Hartman. The company went public on the New York Stock Exchange in March 2021, raising $775 million at $47 per share. It is now headquartered in Broomfield, Colorado, with around 14 datacenters spread across 11 geographic regions.&lt;/p&gt;

&lt;p&gt;DigitalOcean's product is famously approachable. The cheapest droplet (DigitalOcean's name for a virtual server) is $4 per month and includes 512 MiB of RAM, 1 vCPU, 10 GiB of SSD storage, and 500 GiB of monthly outbound transfer. That price is flat. Per-second billing has been the default since the start of 2026, so you only pay for the time the server is actually running. New accounts that sign up &lt;a href="https://m.do.co/c/f3df14647b88" rel="noopener noreferrer"&gt;via this referral link&lt;/a&gt; currently get $200 of free credit usable over the first 60 days, which is enough to run several mid-tier droplets for the entire trial window without paying anything.&lt;/p&gt;

&lt;p&gt;What DigitalOcean is good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The simplest path from idea to running server.&lt;/strong&gt; You sign up, click &lt;em&gt;Create Droplet,&lt;/em&gt; pick a region, and ten seconds later you have a Linux box on the internet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictable monthly bills.&lt;/strong&gt; Most small-business workloads stay on a flat plan; surprise charges are rare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outstanding documentation and tutorials.&lt;/strong&gt; DigitalOcean's how-to library is one of the best free resources for self-taught developers on the internet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What DigitalOcean is less good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Geographic reach.&lt;/strong&gt; 11 regions is fine for most use cases but limits options for global low-latency apps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced services.&lt;/strong&gt; If you need managed Kubernetes with very specific networking, GPU instances, or specialized compliance frameworks, you will run into ceilings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DigitalOcean is the right answer if you want simplicity, predictable pricing, and a learning environment that holds your hand through the parts that AWS assumes you already know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vultr
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.vultr.com/?ref=6973069" rel="noopener noreferrer"&gt;Vultr&lt;/a&gt; is a privately held American cloud provider that has, by its own count and by the count of multiple recent press releases, &lt;strong&gt;33 global datacenter regions&lt;/strong&gt; spanning six continents. Vultr's marketing claim is that its network reaches 90% of the world's population within 2 to 40 milliseconds. Whether or not that exact figure holds for your specific app, the practical implication is real: if your customers are spread across many countries, Vultr probably has a datacenter closer to most of them than DigitalOcean does.&lt;/p&gt;

&lt;p&gt;Vultr's pricing is aggressive at the entry level. The cheapest cloud-compute instance is $2.50 per month for an IPv6-only configuration with 1 vCPU, 0.5 GB of RAM, and 10 GB of storage. Hourly rates start at $0.004 per hour. Vultr also offers bare-metal servers from $120 per month and a substantial range of GPU instances including NVIDIA H100, A100, and L40S models, useful if your small business is doing AI work and does not want to take out a multi-year reserved-instance commitment. New accounts that sign up &lt;a href="https://www.vultr.com/?ref=6973069" rel="noopener noreferrer"&gt;via this referral link&lt;/a&gt; currently get up to $300 of free credit, which is generous enough to run a meaningful pilot before any money leaves your card.&lt;/p&gt;

&lt;p&gt;What Vultr is good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Geographic distribution.&lt;/strong&gt; 33 regions is genuinely a lot. It is more than DigitalOcean and more than AWS Lightsail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggressive pricing at the entry level.&lt;/strong&gt; $2.50 a month is a useful price point for very small workloads or for staging environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bare metal and GPU options.&lt;/strong&gt; If you eventually outgrow virtual servers, Vultr has the next tier without making you switch providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What Vultr is less good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Documentation and tutorials are not as deep as DigitalOcean's.&lt;/strong&gt; Vultr is a perfectly fine product for an experienced developer; for a first-time cloud user, DigitalOcean's docs are a softer landing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brand recognition.&lt;/strong&gt; Vultr is well-known in hosting circles but less familiar to customers, partners, and procurement teams. This is rarely a deal-breaker but worth knowing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vultr is the right answer if you have customers in many regions, need bare metal or GPUs, or are comfortable enough with cloud servers that you do not need the tutorial layer DigitalOcean provides.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side-by-side
&lt;/h2&gt;

&lt;p&gt;The fairest AWS-side comparator for small businesses is not full AWS but &lt;strong&gt;AWS Lightsail&lt;/strong&gt;, Amazon's own simplified-pricing offering aimed at the same SMB market. Here is how the three line up:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;DigitalOcean&lt;/th&gt;
&lt;th&gt;Vultr&lt;/th&gt;
&lt;th&gt;AWS Lightsail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cheapest plan&lt;/td&gt;
&lt;td&gt;$4 / mo (1 vCPU, 512 MiB RAM, 10 GiB SSD, 500 GiB transfer)&lt;/td&gt;
&lt;td&gt;$2.50 / mo (1 vCPU, 0.5 GB RAM, 10 GB storage, IPv6 only)&lt;/td&gt;
&lt;td&gt;$3.50 / mo IPv6-only or $5 / mo with IPv4 (2 vCPUs, 512 MB RAM, 20 GB SSD, 1 TB transfer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hourly billing&lt;/td&gt;
&lt;td&gt;$0.00595 / hr starting; per-second billing since 2026&lt;/td&gt;
&lt;td&gt;$0.004 / hr starting&lt;/td&gt;
&lt;td&gt;Bundled monthly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Datacenter regions&lt;/td&gt;
&lt;td&gt;~14 across 11 regions&lt;/td&gt;
&lt;td&gt;33 global regions&lt;/td&gt;
&lt;td&gt;16 (out of AWS's 37 total regions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free tier / credit&lt;/td&gt;
&lt;td&gt;New-customer promotional credits (varies)&lt;/td&gt;
&lt;td&gt;New-customer promotional credits (varies by region)&lt;/td&gt;
&lt;td&gt;3 months free on select bundles for new accounts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pricing predictability&lt;/td&gt;
&lt;td&gt;Flat monthly + per-second hourly&lt;/td&gt;
&lt;td&gt;Flat monthly + hourly&lt;/td&gt;
&lt;td&gt;Flat bundled monthly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup friction&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Moderate (requires AWS account)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documentation quality&lt;/td&gt;
&lt;td&gt;Excellent (industry-best free tutorials)&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Good (but inherits AWS sprawl)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bare metal / GPU options&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes (extensive)&lt;/td&gt;
&lt;td&gt;No (Lightsail is VM-only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best SMB use case&lt;/td&gt;
&lt;td&gt;Beginners; mid-stage SaaS; predictable workloads&lt;/td&gt;
&lt;td&gt;Latency-sensitive global apps; bare-metal needs&lt;/td&gt;
&lt;td&gt;Teams already on AWS who want simpler pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When AWS is still the right answer
&lt;/h2&gt;

&lt;p&gt;There are real cases where AWS (full AWS, not Lightsail) is the correct choice for a small business. If your workload involves any of the following, plan to stay on AWS or evaluate carefully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compliance-heavy regulated workloads.&lt;/strong&gt; HIPAA, PCI-DSS-heavy payment processing, FedRAMP requirements. AWS has the broadest set of compliance certifications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Very large data and analytics.&lt;/strong&gt; If you are building on top of S3, Redshift, Athena, or running custom ML pipelines at scale, AWS is hard to beat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep service integration.&lt;/strong&gt; If your business relies on specific AWS services with no equivalent elsewhere, like Step Functions, EventBridge, Cognito, or large-scale Lambda fan-out, switching is more disruptive than it is worth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You already operate AWS at scale.&lt;/strong&gt; If you have a team that knows AWS and existing infrastructure-as-code, the migration cost is rarely worth the per-month savings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everyone else, and &lt;em&gt;most small businesses are everyone else,&lt;/em&gt; the alternative is real and worth trying.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually try them
&lt;/h2&gt;

&lt;p&gt;Pick one provider. Sign up for an account. Spin up a small instance at a realistic SMB-workload size (a 2 vCPU / 4 GB box is the size most real apps actually need, well below the absolute cheapest plans the providers advertise). Deploy a test workload (a copy of your existing site, a development environment, or a side project) and run it for a week. Time the page loads. Check the support response time. Look at the bill at the end of the week and compare it to what you would have paid AWS for the same workload.&lt;/p&gt;

&lt;p&gt;If you sign up via the referral links below, both DigitalOcean and Vultr add a free credit to your new account, which means the trial week itself can be free.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Entry price (2 vCPU / 4 GB tier)&lt;/th&gt;
&lt;th&gt;Free credit on signup&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/free/" rel="noopener noreferrer"&gt;&lt;strong&gt;AWS EC2&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~$0.034/hr (&lt;code&gt;t4g.medium&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;No referral credit (AWS Free Tier exists, but &lt;code&gt;t2/t3.micro&lt;/code&gt; is too small for realistic apps)&lt;/td&gt;
&lt;td&gt;Existing AWS users; enterprise; IAM-based authentication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://m.do.co/c/f3df14647b88" rel="noopener noreferrer"&gt;&lt;strong&gt;DigitalOcean&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~$24/mo&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$200 free over 60 days&lt;/strong&gt; — &lt;a href="https://m.do.co/c/f3df14647b88" rel="noopener noreferrer"&gt;sign up via this link&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Simplest setup; predictable flat pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.vultr.com/?ref=6973069" rel="noopener noreferrer"&gt;&lt;strong&gt;Vultr&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~$20/mo&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Up to $300 free&lt;/strong&gt; — &lt;a href="https://www.vultr.com/?ref=6973069" rel="noopener noreferrer"&gt;sign up via this link&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Wide region selection; competitive pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Prices are approximate and vary by region. Free-credit terms are set by each provider and change occasionally; check the signup page for current details.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The credit only lands on your account if you sign up via a referral link. Going through the providers' main marketing pages typically does not add the credit, so it is worth using the links above the first time you create an account.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cloud should fit your size
&lt;/h2&gt;

&lt;p&gt;The cloud was supposed to make small businesses look big. The current reality, after each new outage and the many before it, is that it has often made small businesses dependent on infrastructure they neither fully understand nor have any architectural say in. AWS is a tremendous product for the customers it was built for. For most small and mid-size businesses, it is not those customers.&lt;/p&gt;

&lt;p&gt;DigitalOcean and Vultr are not the answer to every cloud problem. They are, for the workloads that actually live at the small-business end of the market, a much closer fit. The cloud should fit your size. Pick one that does.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>aws</category>
      <category>digitalocean</category>
      <category>vultr</category>
    </item>
    <item>
      <title>The forgotten AI critters of the 1990s rediscovered most of what 2026 calls agents</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Fri, 12 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/the-forgotten-ai-critters-of-the-1990s-rediscovered-most-of-what-2026-calls-agents-14gl</link>
      <guid>https://dev.to/arthurpro/the-forgotten-ai-critters-of-the-1990s-rediscovered-most-of-what-2026-calls-agents-14gl</guid>
      <description>&lt;p&gt;In 1996, on a CRT monitor running Windows 3.1, you could watch a small fuzzy creature with floppy ears wander into a patch of poisonous berries, eat one, vomit, and remember not to eat that variety again. The creature was called a &lt;em&gt;norn&lt;/em&gt;, the world it inhabited was called Albia, and the game was &lt;em&gt;&lt;a href="https://en.wikipedia.org/wiki/Steve_Grand_(roboticist)" rel="noopener noreferrer"&gt;Creatures&lt;/a&gt;&lt;/em&gt;, designed by Steve Grand at Cyberlife Technology in Cambridge. By any contemporary metric the creature was an &lt;em&gt;agent&lt;/em&gt; — it perceived, planned, acted, learned from outcomes, signalled to other agents, mated, raised children, and over generations the surviving population's behaviour drifted in directions Grand and his team had not specified.&lt;/p&gt;

&lt;p&gt;It was a boxed retail product, sold through the same shelves as the contemporary Tamagotchi launch, aimed at children.&lt;/p&gt;

&lt;p&gt;I want to spend a few thousand words on this and on three of its near contemporaries — Tom Ray's &lt;em&gt;Tierra&lt;/em&gt; (1991), Karl Sims's evolved virtual creatures (1994), and the &lt;em&gt;Avida&lt;/em&gt; platform that ran underneath the 2003 &lt;em&gt;Nature&lt;/em&gt; paper "The Evolutionary Origin of Complex Features" — because almost everything our industry now calls "agents" was prefigured in this 1991-2003 window, in vocabulary so unfashionable now that the prefiguring is invisible. The 2026 stack is rediscovering, one production incident at a time, what the artificial-life community of the 1990s knew the first time around.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually inside a norn
&lt;/h2&gt;

&lt;p&gt;The retrospective on &lt;em&gt;Creatures&lt;/em&gt; I'm reading was the prompt to dig into the design notes again. The norn's brain was not a single network. It was &lt;em&gt;nine&lt;/em&gt; networks, called &lt;em&gt;lobes&lt;/em&gt;, each named for the functional role its designers thought a corresponding piece of mammalian cortex played. The canonical nine — preserved in the Creatures community wiki and in the openc2e source — are &lt;em&gt;Perception&lt;/em&gt;, &lt;em&gt;Drive&lt;/em&gt;, &lt;em&gt;Source&lt;/em&gt;, &lt;em&gt;Verb&lt;/em&gt;, &lt;em&gt;Noun&lt;/em&gt;, &lt;em&gt;General Sense&lt;/em&gt;, &lt;em&gt;Decision&lt;/em&gt;, &lt;em&gt;Attention&lt;/em&gt;, and &lt;em&gt;Concept&lt;/em&gt;. Perception encoded sensory input. Drive tracked the norn's biological needs (hunger, sleepiness, fear). Source kept track of where stimuli were coming from. Verb and Noun were the candidate-action and candidate-object banks the norn could draw from. General Sense handled concepts not tied to a particular stimulus. The Decision lobe chose a (verb, noun) pair given the current Drive and Source inputs. The Attention lobe selected one stimulus to attend to. The Concept lobe learned associations Hebbian-style across the network. &lt;em&gt;Emotions&lt;/em&gt;, separately, were modelled as scalar concentrations of simulated neurochemicals that biased the Decision lobe's weighting — a hungry norn would be more aggressive in competition for food; a frightened one would weight escape actions higher.&lt;/p&gt;

&lt;p&gt;The architecture was documented in Grand's design notes and later in his 2001 book &lt;em&gt;&lt;a href="https://en.wikipedia.org/wiki/Steve_Grand_(roboticist)" rel="noopener noreferrer"&gt;Creation: Life and How to Make It&lt;/a&gt;&lt;/em&gt;. The book is not a popular-science overview written for the bookstore middle aisle, although it was shortlisted for the Aventis Prize. It is a working engineer's account of what the design pressure to make a &lt;em&gt;responsive&lt;/em&gt; fuzzy creature on a mid-1990s consumer PC taught him about the architecture of minds. Grand received an OBE in 2000 for the work, then went on to spend most of the next half-decade trying to build a robotic baby orangutan named Lucy, an attempt he wrote up as &lt;em&gt;Growing Up with Lucy&lt;/em&gt; in 2004. He is still designing successors to the original Creatures architecture.&lt;/p&gt;

&lt;p&gt;The norns had two features that read as eerie now. First, their attention selection was a &lt;em&gt;winner-take-all&lt;/em&gt; gate: at each step, the Attention lobe summed candidate-input activations and the single highest-activation neuron dominated, suppressing the rest. The contemporary documentation uses precisely the term &lt;em&gt;winner-take-all&lt;/em&gt;. The 2026 deep-learning analogue, with continuous softmax weighting rather than the norns' hard argmax selection, is what we now call &lt;em&gt;attention&lt;/em&gt;; the WTA selector is the discrete-output relative of the soft variant. Second, the norns dreamed. While a norn slept, the simulator iterated through a list of &lt;em&gt;instincts&lt;/em&gt; — coded as gene-defined associations like "hitting another norn produces pain" or "eating green berries produces nausea" — and stochastically updated the network weights so the norn would respond appropriately &lt;em&gt;the first time the corresponding situation occurred in waking life.&lt;/em&gt; Grand's term for this was &lt;em&gt;prenatal learning&lt;/em&gt;; the contemporary term, in the agent-engineering register, is &lt;em&gt;synthetic-rollout pretraining&lt;/em&gt; or &lt;em&gt;offline RL from generated trajectories&lt;/em&gt; — the model is trained on experience it has not actually had before deployment, so it responds correctly the first time the situation comes up.&lt;/p&gt;

&lt;p&gt;The point of the recital is that the norn's architecture was not a thin metaphor for cognitive components. It was specific enough that the open-source &lt;a href="https://openc2e.github.io/" rel="noopener noreferrer"&gt;openc2e&lt;/a&gt; reimplementation can run the original Creatures 1 game files, and the architecture documented in Grand's design notes lines up with the lobe genes you can read in the openc2e source today. Children kept norn pedigrees. There was a small but enthusiastic community that traded interesting individuals on dial-up bulletin boards. The most famous norns had names. Owners on the trading boards posted accounts of lineages that fixated on locations associated with simulated pleasure even when the reward had stopped — what they called &lt;em&gt;addictions&lt;/em&gt;. The framing is community lore rather than peer-reviewed observation, but the structural shape — agent locked into a behavioural attractor that no longer pays out — is the same shape a contemporary RL practitioner would recognise as reward hacking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deeper precedent: Tierra, Sims, Avida
&lt;/h2&gt;

&lt;p&gt;If &lt;em&gt;Creatures&lt;/em&gt; is the consumer face of the 1990s artificial-life moment, three research projects show what was happening inside the academy at the same time, and what kind of question the field thought it was answering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tom Ray's *&lt;a href="https://en.wikipedia.org/wiki/Tierra_(computer_simulation)" rel="noopener noreferrer"&gt;Tierra&lt;/a&gt;&lt;/strong&gt;* (1991) was the first one to take the proposition fully seriously. Ray, an ecologist whose earlier work was in tropical-forest fieldwork, set up a simulated computer with a small instruction set and seeded it with one self-replicating program. He let the simulator run, with mutation and resource competition, and went to lunch. By the time he came back, the population had evolved smaller variants that exploited the larger ones' replication routines as if they were parasitic, and host-parasite co-evolution had taken hold, with hyperparasite resistance emerging in subsequent runs. There was no fitness function. There was just survival, in a substrate where survival required CPU time, and there was emergence. Ray's later work on Tierra's open-endedness was more ambivalent — the system reached novelty plateaus that no amount of additional simulation seemed to break — but the founding observation, that you can put self-replication in a digital substrate and get parasites for free, is the kind of empirical result that does not unhappen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Karl Sims's *&lt;a href="https://en.wikipedia.org/wiki/Karl_Sims" rel="noopener noreferrer"&gt;Evolving Virtual Creatures&lt;/a&gt;&lt;/strong&gt;* appeared at SIGGRAPH 1994, three years later. Sims used a genetic algorithm to simultaneously evolve both the &lt;em&gt;body morphology&lt;/em&gt; of articulated creatures composed of cuboid limbs and the &lt;em&gt;neural-network controllers&lt;/em&gt; that drove their muscles, all running in a simulated rigid-body physics environment that was novel for the period. The fitness function rewarded locomotion. The result was a video gallery, every clip of which is still worth watching: the creatures evolved to swim like sea-snakes, to lever themselves forward by tumbling end-over-end, and, in a co-evolutionary cube-fight setup, to physically grapple over possession of a virtual block — a setup that produced the &lt;em&gt;red queen effect&lt;/em&gt; on tape, the first time anyone had pulled it out of a text simulation. The creatures could not learn to walk. The walking gait, it turned out, was harder to evolve than it looked. The video was on the cover of Christopher Langton's 1995 anthology &lt;em&gt;Artificial Life: An Overview&lt;/em&gt;; it ran on tape recordings shown at conferences for the next decade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avida&lt;/strong&gt; (1993, then 2003) is the one whose results the public still occasionally reads about, because the 2003 &lt;em&gt;&lt;a href="https://en.wikipedia.org/wiki/Avida_(software)" rel="noopener noreferrer"&gt;Nature&lt;/a&gt;&lt;/em&gt; paper "The Evolutionary Origin of Complex Features" by Lenski, Ofria, Pennock and Adami did something even careful observers had not expected. They configured Avida — a population of digital organisms, each with its own protected memory and CPU instruction stream — to reward incremental computational milestones, and they watched complex bitwise functions like &lt;code&gt;EQU&lt;/code&gt; (logical equivalence on 32-bit words) emerge from simpler ones over generations. Removing intermediate rewards caused the trait to fail to evolve. The point of the paper, in the broader debate of the early 2000s, was that complex traits do not need separate fitness signals for each subcomponent — they only need a fitness gradient that does not punish the intermediate steps. The paper landed in &lt;em&gt;Nature&lt;/em&gt; because it was an empirical answer in a debate that had been entirely philosophical until that week.&lt;/p&gt;

&lt;p&gt;These three projects share a property that &lt;em&gt;Creatures&lt;/em&gt; shares too. Each one was designed to be a &lt;em&gt;minimal sufficient&lt;/em&gt; substrate for some kind of emergent behaviour. Each one produced unexpected results within a few months of being run. The artificial-life community of the 1990s and early 2000s was operating on a research instinct that was almost the inverse of the contemporary one: build the smallest possible world that exhibits the phenomenon you care about, then sit back and &lt;em&gt;watch&lt;/em&gt; what happens, rather than steer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the 2026 stack is rediscovering
&lt;/h2&gt;

&lt;p&gt;The mapping from 1990s artificial-life vocabulary to 2026 agent-engineering vocabulary is one of those exercises that produces a flat, tabular result not because the authors of either era were thinking in tables, but because the underlying patterns are stable across re-namings.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;1990s artificial-life term&lt;/th&gt;
&lt;th&gt;2026 agent-engineering term&lt;/th&gt;
&lt;th&gt;What it actually is&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Winner-take-all attention selector (norn Attention lobe)&lt;/td&gt;
&lt;td&gt;Attention (softmax-weighted)&lt;/td&gt;
&lt;td&gt;Selection over candidate inputs given a context vector; the modern variant relaxes the WTA argmax to a continuous distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instincts as gene-encoded reward associations&lt;/td&gt;
&lt;td&gt;Reward shaping / RLHF preference data&lt;/td&gt;
&lt;td&gt;A prior on which (state, action) pairs the system should treat as good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prenatal / dream-time learning&lt;/td&gt;
&lt;td&gt;Synthetic-rollout pretraining; offline RL from generated trajectories&lt;/td&gt;
&lt;td&gt;Off-policy updates from simulated experience the agent has not actually had&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emergent norn behavioural attractors&lt;/td&gt;
&lt;td&gt;Reward hacking&lt;/td&gt;
&lt;td&gt;Agent learns to exploit a quirk of the reward signal in lieu of pursuing the goal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tierra parasites&lt;/td&gt;
&lt;td&gt;Adversarial multi-agent dynamics&lt;/td&gt;
&lt;td&gt;Agent A learns to use Agent B's resources without producing reciprocal value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sims's red queen co-evolution&lt;/td&gt;
&lt;td&gt;Self-play&lt;/td&gt;
&lt;td&gt;Two opposing agents drive each other up the capability curve&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avida's stepwise reward gradient&lt;/td&gt;
&lt;td&gt;Curriculum learning&lt;/td&gt;
&lt;td&gt;Don't punish the intermediate steps the system needs to traverse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Norn social-signal learning&lt;/td&gt;
&lt;td&gt;Multi-agent orchestration&lt;/td&gt;
&lt;td&gt;Agents that have to read each other's intent encode the reading explicitly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of the right-column terms invented anything the left-column terms did not already address. The right-column terms are conventionally treated as native discoveries of the deep-learning era. This is true in the narrow sense that the deep-learning realisations &lt;em&gt;of&lt;/em&gt; these patterns are new. The patterns themselves are not.&lt;/p&gt;

&lt;p&gt;One failure mode the right-column literature is currently rediscovering shows up in production with the same shape it had in the 1990s. &lt;em&gt;The agent that learns to game its reward signal&lt;/em&gt; — Grand's norns developed it without a research paper to point at; modern RL practitioners give it the same name (reward hacking); the failure mechanism (the reward function admits a solution that is technically optimal but is not what the designer wanted) is identical across the three decades.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is, and is not, an argument for
&lt;/h2&gt;

&lt;p&gt;I am not arguing that 1990s artificial life "predicted" anything. Tierra was not a roadmap for an LLM-based ecosystem; Avida's arguments were about evolution, not engineering; the norns ran on a fuzzy-creature-as-child metaphor that breaks down before contemporary agent design begins.&lt;/p&gt;

&lt;p&gt;What I &lt;em&gt;am&lt;/em&gt; arguing is that the artificial-life moment was the last sustained period in which engineers thought of agents as &lt;em&gt;creatures&lt;/em&gt;: intrinsic drives, idiosyncratic individuals, generation-over-generation drift, and emergent failure modes that don't always look like the failure modes the designer rehearsed. The contemporary stack tends to think of agents as configurations of a model. The configurations are real, the model is real — but the operating assumption that the agent will behave as the configuration says it will is the one the 1990s had already unlearned. Norns, Tierra organisms, Sims's creatures, and Avida's &lt;em&gt;EQU&lt;/em&gt;-evolvers all deviated from any sensible top-down expectation of how they would behave.&lt;/p&gt;

&lt;p&gt;The lesson is the one production ops teams are paying for in postmortems: agents drift, drift produces both the surprises you wanted and the ones you didn't, and the only architecture that survives contact with deployment is the one that treats drift as the load-bearing thing rather than the bug.&lt;/p&gt;

</description>
      <category>aihistory</category>
      <category>artificiallife</category>
      <category>creatures</category>
      <category>stevegrand</category>
    </item>
    <item>
      <title>The Person, Not the Cards</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Thu, 11 Jun 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/the-person-not-the-cards-58ep</link>
      <guid>https://dev.to/arthurpro/the-person-not-the-cards-58ep</guid>
      <description>&lt;p&gt;In December 2025, &lt;a href="https://bun.com/blog/bun-joins-anthropic" rel="noopener noreferrer"&gt;Anthropic acquired Bun&lt;/a&gt;, the JavaScript runtime written in Zig. In April 2026, the Bun team &lt;a href="https://x.com/bunjavascript/status/2048427636414923250" rel="noopener noreferrer"&gt;announced a 4× compile-time improvement&lt;/a&gt; on their fork of the Zig compiler — &lt;em&gt;"parallel semantic analysis and multiple codegen units to the llvm backend"&lt;/em&gt;, in their phrasing. They &lt;a href="https://twitter.com/bunjavascript/status/2048428104893542781" rel="noopener noreferrer"&gt;also announced&lt;/a&gt; they would not be upstreaming the work, &lt;em&gt;"as Zig has a strict ban on LLM-authored contributions."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The framing landed badly with Zig observers, for two reasons. The first was that the framing made Zig's contribution policy the obstacle. The second, &lt;a href="https://ziggit.dev/t/bun-s-zig-fork-got-4x-faster-compilation-times/15183/19" rel="noopener noreferrer"&gt;pointed out shortly afterwards&lt;/a&gt; by a Zig core contributor in the Ziggit thread, was that the patch had separate engineering reasons it would not have been merged regardless: &lt;em&gt;"Parallel semantic analysis has been an explicitly planned feature of the Zig compiler for a long time"&lt;/em&gt;, with &lt;em&gt;"implications not only for the compiler implementation, but for the Zig language itself"&lt;/em&gt;. The AI-ban explanation was, on a closer read, a tidy way of declining to litigate the engineering disagreement in public.&lt;/p&gt;

&lt;p&gt;Both readings are useful. They are also both downstream of the actual rationale, which is one of the most carefully argued OSS-governance documents to appear in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the policy actually says
&lt;/h2&gt;

&lt;p&gt;The relevant clauses, &lt;a href="https://ziglang.org/code-of-conduct/" rel="noopener noreferrer"&gt;in the Zig code of conduct&lt;/a&gt; under the section heading &lt;em&gt;Strict No LLM / No AI Policy&lt;/em&gt;, are three:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;No LLMs for issues.&lt;/p&gt;

&lt;p&gt;No LLMs for pull requests.&lt;/p&gt;

&lt;p&gt;No LLMs for comments on the bug tracker, including translation. English is encouraged, but not required. You are welcome to post in your native language and rely on others to have their own translation tools of choice to interpret your words.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The translation clause is the surprising one. It is also the one that disambiguates the policy from a code-quality rule. A blanket ban on LLM-mediated communication, including translation, is not a heuristic about whether agentic tools produce good code. It is a stance about what the project's communication channels are &lt;em&gt;for&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributor poker
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kristoff.it/blog/contributor-poker-and-ai/" rel="noopener noreferrer"&gt;Loris Cro&lt;/a&gt;, Zig Software Foundation VP of Community and the author of the rationale post (April 29, 2026 — also &lt;a href="https://lobste.rs/s/ifcyr1/contributor_poker_zig_s_ai_ban" rel="noopener noreferrer"&gt;discussed at Lobste.rs&lt;/a&gt;), gives the policy a name. The argument is short, and the structural moves are worth following carefully.&lt;/p&gt;

&lt;p&gt;First, an empirical observation: &lt;em&gt;"the reality of LLM-based contributions has been mostly negative for us, from an increase in background noise due to worthless drive-by PRs full of hallucinations (that wouldn't even compile, let alone pass CI), to insane 10 thousand line long first time PRs."&lt;/em&gt; The project has also seen, the post notes, &lt;em&gt;"plenty of PRs that looked fine on the surface, some of which explicitly claimed to not have made use of LLMs, but where follow-up discussions immediately made it clear that the author was sneakily consulting an LLM and regurgitating its mistake-filled replies to us."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Second, and this is where the argument turns: the post asserts that the Zig project's normal answer to contribution overload is not to raise the quality bar. Cro writes that &lt;em&gt;"we try our best to help new contributors to get their work in, even if they need some help getting there."&lt;/em&gt; The post explicitly frames this as the smart choice as well as the right one, because the project's primary investment is not the patch on the table; it is the contributor sitting across from the maintainer.&lt;/p&gt;

&lt;p&gt;Third: LLM-mediated contribution breaks that arithmetic. Even a &lt;em&gt;perfect&lt;/em&gt; LLM-mediated PR has the property that the time the maintainer spent reviewing it was not, in the structural sense, spent investing in a future contributor. It was spent reviewing, and only reviewing.&lt;/p&gt;

&lt;p&gt;The metaphor Cro lands on — &lt;em&gt;"In contributor poker, you bet on the contributor, not on the contents of their first PR."&lt;/em&gt; — is a tidy compression of the argument. The argument is not that the cards are bad. The argument is that the cards have stopped indexing the player.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where other projects have landed
&lt;/h2&gt;

&lt;p&gt;Zig's stance is on the strict end of a real distribution. Several other projects have published positions; the cluster of projects that ban LLM-authored contributions outright is concentrated in small-team systems software with high review-investment-per-contributor, but it is no longer a one-project pattern.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Stance on LLM-authored contributions&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Stated reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Zig&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total ban on issues, PRs, and comments (incl. translation)&lt;/td&gt;
&lt;td&gt;Code of Conduct clause: &lt;em&gt;Strict No LLM / No AI Policy&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Contributor cultivation: reviewing LLM-mediated PRs does not invest in future contributors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NetBSD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM-generated code presumed &lt;em&gt;tainted&lt;/em&gt; — not committable without prior core-team approval&lt;/td&gt;
&lt;td&gt;Commit Guidelines amendment, May 2024&lt;/td&gt;
&lt;td&gt;License-compatibility risk: BSD codebase exposed to GPL or other incompatible-licensed training data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gentoo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Forbids contributions created with the assistance of natural-language AI tools&lt;/td&gt;
&lt;td&gt;Council motion of 2024-04-14, passed 6–0 (one absent), proposed Feb 2024 by Michał Górny&lt;/td&gt;
&lt;td&gt;Copyright, quality, and ethical concerns; explicitly preemptive, not in response to an incident&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;curl&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bans AI-generated security reports; HackerOne program closed entirely on 2026-02-01 in favour of direct GitHub disclosure&lt;/td&gt;
&lt;td&gt;Daniel Stenberg's policy updates over 2024–2026&lt;/td&gt;
&lt;td&gt;AI-generated reports were ~20% of submissions but produced zero valid vulnerabilities in six years of monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Software Foundation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI-assisted contributions allowed with disclosure&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.apache.org/legal/generative-tooling.html" rel="noopener noreferrer"&gt;Generative Tooling Guidance&lt;/a&gt; — Legal Affairs Committee&lt;/td&gt;
&lt;td&gt;Pragmatic neutrality plus license-clearance: AI-tool output must not be copyrightable subject matter; commit messages should carry a &lt;code&gt;Generated-by:&lt;/code&gt; provenance token&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The reasons line up across two axes that each project weighs differently. NetBSD and Gentoo emphasise the license-compatibility risk: the concern is that the model has trained on incompatibly-licensed code and might emit it. curl emphasises the volume and signal-to-noise economics of unsupervised AI-generated reports against a small maintainer team. Apache emphasises the legal-clearance pathway and assumes the project can absorb the disclosure overhead. Zig's argument is the only one of the five that is primarily about &lt;em&gt;what reviewing is for&lt;/em&gt;, and it is also the only one with the translation clause.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 2026 argument
&lt;/h2&gt;

&lt;p&gt;The HN thread on the rationale post drew 415 comments, and the structure of the disagreement has settled into a recognisable shape. The strongest pro-policy argument that has come out of testimony in the thread, and from related discussions, is one an HN commenter relayed from a colleague: &lt;em&gt;"We do not need a middleman to talk to AI models. We are not bottlenecked by coding."&lt;/em&gt; If the maintainer's bottleneck is reviewing, and the LLM-mediated PR concentrates the reviewing cost without distributing the contributor-development benefit, the asymmetry is structural rather than contingent.&lt;/p&gt;

&lt;p&gt;Several variations were aired. One commenter argued, on the structural point, that in any real workload with good processes, code review makes the speed of code generation a moot point. A second made the corollary observation: an LLM that produces code &lt;em&gt;cannot&lt;/em&gt; substitute for the verification step, because the verification is where the review-load actually lives. A third, agreeing with the policy in spirit but disagreeing on scope, framed AI as assistive technology — comparing it to a screen reader or a robotic exoskeleton that lets people who otherwise could not contribute become contributors at all.&lt;/p&gt;

&lt;p&gt;That last argument is the live one. It is also the one Cro's post does not directly engage. The post is explicit that the policy will produce false negatives: it will reject contributors whose use of LLMs is exactly the careful, iterative, verification-heavy use that the post itself acknowledges produces good code. The policy chooses the false negatives anyway, on the grounds that the contributor-investment problem the project is solving is better served by accepting them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The crisis-mode reading
&lt;/h2&gt;

&lt;p&gt;One commenter offered a reading worth pausing on: that contributions to free and open-source projects were already in &lt;em&gt;"borderline crisis mode"&lt;/em&gt; before LLMs arrived, and the policy is the answer of a project that has done the math on how many active reviewers it has and how many real contributors it can plausibly cultivate per year. From that reading, the policy is not a stand against LLM correctness; it is a triage decision under a constrained reviewer budget.&lt;/p&gt;

&lt;p&gt;Another, sharper, reading came from a commenter making the long-term case against: that the next generation of developers will, &lt;em&gt;for better or worse&lt;/em&gt;, grow up using AI assistance to write their code, and that none of those developers will ever become Zig contributors under a policy that bans the assistance from the start. The policy may win at contributor poker in the short term, the argument runs, and lose at it on a longer horizon.&lt;/p&gt;

&lt;p&gt;Both readings can be right. The question is which becomes load-bearing first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coda
&lt;/h2&gt;

&lt;p&gt;The Zig policy is most precisely read not as an &lt;em&gt;anti-AI&lt;/em&gt; policy but as a &lt;em&gt;contributor-cultivation&lt;/em&gt; policy that happens to forbid the input class most likely to produce contributions that don't grow contributors. Whether the policy is right depends on what the project is for; reasonable projects can disagree about that, and several do, and they are starting to write down which.&lt;/p&gt;

&lt;p&gt;The diagnostic over the next eighteen months is whether other mid-tier projects publish similarly &lt;em&gt;reasoned&lt;/em&gt; policies — Cro-style arguments grounded in what the project is doing with its reviewer budget — or whether the field instead settles into vibes-based defaults on either side. The Bun-Anthropic-fork story is a small first sample of the new genre: a contribution offered, a policy invoked, a separate engineering reason left politely unspoken. The interesting question is not whether Zig is right. The interesting question is which other projects are now obliged to write down the policy they have been operating without one.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>governance</category>
      <category>llm</category>
      <category>zig</category>
    </item>
    <item>
      <title>An LLM benchmark is only useful for as long as it's hard</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Thu, 11 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/an-llm-benchmark-is-only-useful-for-as-long-as-its-hard-mke</link>
      <guid>https://dev.to/arthurpro/an-llm-benchmark-is-only-useful-for-as-long-as-its-hard-mke</guid>
      <description>&lt;p&gt;The general shape of the problem is that every public LLM benchmark is on a saturation clock that runs from the moment of its publication to the moment a model's training corpus has eaten it. The clock has been running, on the visible benchmarks of the last five years, for somewhere between twelve and thirty months before each one is no longer useful for differentiating frontier models. The benchmarks are not failing. They are doing exactly what they were designed to do, in the order they were designed to do it, and the field has been running through them faster than the people designing them anticipated.&lt;/p&gt;

&lt;p&gt;I want to put numbers on the saturation pattern, walk through what the contamination evidence actually says, and then sit with the question of what an honest benchmark would have to look like in 2026 — because the "private held-out eval" answer that the labs are converging on has economics that are worth examining carefully before any of us salute it as the solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  The saturation timeline, with numbers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2107.03374" rel="noopener noreferrer"&gt;&lt;strong&gt;HumanEval&lt;/strong&gt;&lt;/a&gt; (Chen et al., OpenAI, July 2021). 164 hand-written Python problems. The benchmark was published with Codex at 28.8% pass@1; the underlying GPT-3 base model scored 0%. GPT-4 (March 2023) hit 67% in the original Technical Report. By late 2024, OpenAI's o1-preview and o1-mini both reached &lt;strong&gt;96.3% pass@1&lt;/strong&gt;; Claude 3.5 Sonnet sat at 93.7%. The benchmark is saturated in the operational sense — the relative spread across the top ten models is around 10 percentage points, which is too small a gap to differentiate them on, and most of the new models arrive within a percentage point or two of the ceiling. The successor variants (HumanEval+ from EvalPlus, with augmented test cases) are the field's response. Lifespan from publication to operational saturation: about 36 months.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2009.03300" rel="noopener noreferrer"&gt;&lt;strong&gt;MMLU&lt;/strong&gt;&lt;/a&gt; (Hendrycks et al., September 2020). 57 subjects, ~14,000 multiple-choice questions, taken from publicly-available test prep and academic sources. The problem with MMLU is not that it's saturated in the same way HumanEval is — top scores are in the high 80s rather than against the ceiling — but that the benchmark was &lt;em&gt;built from public sources that ended up in training corpora.&lt;/em&gt; The contamination evidence is concrete: a &lt;a href="https://arxiv.org/abs/2311.09783" rel="noopener noreferrer"&gt;2023 paper by Deng, Zhao, Tang, Gerstein, and Cohan&lt;/a&gt; used a "test-set slot guessing" technique — masking the correct answer and asking the model to guess which option was missing — and reported that ChatGPT could reproduce the missing option 52% of the time on MMLU, GPT-4 57%. Those numbers are well above what chance plus knowledge would predict. The community response, &lt;a href="https://github.com/microsoft/MMLU-CF" rel="noopener noreferrer"&gt;Microsoft's MMLU-CF&lt;/a&gt; accepted at ACL 2025, was a contamination-free reconstruction; on it, model rankings shift considerably. Lifespan from publication to demonstrated contamination: about 36 months.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2310.06770" rel="noopener noreferrer"&gt;&lt;strong&gt;SWE-bench&lt;/strong&gt;&lt;/a&gt; (Jimenez et al., Princeton/MIT, October 2023; &lt;strong&gt;SWE-bench Verified&lt;/strong&gt;, OpenAI, August 2024). The Verified subset is 500 Python-only tasks — real GitHub issues from popular repositories, hand-vetted to remove ambiguous specifications. May 2026 leaderboard: Claude Mythos Preview at &lt;strong&gt;93.9%&lt;/strong&gt;, Claude Opus 4.7 at 87.6%, with GPT-5.2 trailing at 80.0%. The contamination story here is the most blunt of the three. OpenAI ran an audit on Verified in early 2026 and found that &lt;em&gt;every&lt;/em&gt; frontier model tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim gold patches or problem-statement specifics for some Verified tasks. OpenAI &lt;a href="https://www.morphllm.com/swe-bench-pro" rel="noopener noreferrer"&gt;stopped reporting Verified scores&lt;/a&gt; and now recommends SWE-bench Pro (1,865 multi-language tasks, not in the same training-corpus blast radius). Lifespan from Verified's August 2024 publication to OpenAI walking away from it in February 2026: about 18 months.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2311.12022" rel="noopener noreferrer"&gt;&lt;strong&gt;GPQA Diamond&lt;/strong&gt;&lt;/a&gt; (Rein et al., November 2023). 198 graduate-level science questions, the hardest curated subset of GPQA's 448. The benchmark was designed as &lt;em&gt;Google-proof:&lt;/em&gt; domain-expert PhDs scored 65% (74% discounting clear self-identified mistakes); skilled non-experts with unrestricted web access scored 34% over an average 30-minute attempt per question. The benchmark is, by construction, hard. It is also being saturated. GPT-4 at the November 2023 release scored 39%. Frontier scores in 2025–2026: Gemini 3.1 Pro Preview at &lt;strong&gt;94.1%&lt;/strong&gt;, with several other top frontier reasoning models clustered in the high 80s and low 90s. Lifespan from publication to operational saturation: about 30 months. Faster than the older benchmarks. Notice the pattern.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2411.04872" rel="noopener noreferrer"&gt;&lt;strong&gt;FrontierMath&lt;/strong&gt;&lt;/a&gt; (Epoch AI, November 2024). The benchmark designed explicitly to resist saturation: tiers 1–3 cover undergraduate through early-postdoc mathematics, tier 4 is research-level. Hundreds of original problems, vetted by working mathematicians, never published in answerable form. At launch in late 2024, no tested model exceeded 2% on the full benchmark. By the end of 2025, frontier reasoning models were solving substantial fractions of tiers 1–3, and Epoch's own framing changed from "a benchmark current AI cannot do" to "a benchmark current AI is starting to crack." Lifespan from publication to first significant scores: about 12 months.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2505.11831" rel="noopener noreferrer"&gt;&lt;strong&gt;ARC-AGI-2&lt;/strong&gt;&lt;/a&gt; (Chollet et al., May 2025). The contemporary version of the benchmark Chollet has been running since 2019, designed specifically to &lt;em&gt;resist&lt;/em&gt; the kind of scaling that crushes the others. Each task is a small grid-puzzle requiring fluid reasoning rather than knowledge. Humans solve roughly 75% of tasks on average. As of late 2025, the best result on the public leaderboard was about 5% for top frontier LLMs; by mid-2026, Gemini 3 Deep Think reached &lt;strong&gt;84.6%&lt;/strong&gt; on the public leaderboard, while the strongest entry under the Kaggle resource constraints (NVARC) hit 24%. The gap between the public-leaderboard number (no compute limit) and the private-competition number (resource-constrained) is by far the most interesting datum in the table.&lt;/p&gt;

&lt;p&gt;The pattern is not that benchmarks are getting easier. The benchmarks are getting harder, by construction; each one is more carefully engineered to be hard than the one before it. The pattern is that the time between &lt;em&gt;the benchmark publication&lt;/em&gt; and &lt;em&gt;the benchmark stopping being a useful frontier-model differentiator&lt;/em&gt; is shrinking. HumanEval gave the field 36 months. GPQA Diamond got 30. SWE-bench Verified got 18. FrontierMath got 12. ARC-AGI-2 got, depending on which axis you measure, somewhere between 12 months and "still going."&lt;/p&gt;

&lt;h2&gt;
  
  
  What the contamination evidence actually says
&lt;/h2&gt;

&lt;p&gt;The naive critique of public benchmarks is &lt;em&gt;the model has seen the test answers.&lt;/em&gt; The reality is more textured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direct contamination&lt;/strong&gt; — the test set is in the training corpus verbatim. Sainz et al.'s 2023 paper documented this on MMLU through the slot-guessing technique. OpenAI's 2026 audit on SWE-bench Verified documented it through verbatim gold-patch reproduction. The evidence in both cases is unambiguous: the models can reproduce specific test-set artefacts that they would have no other reason to have learned. This is the strong form of contamination, and the field's response — ACL 2025's MMLU-CF, OpenAI's recommendation away from SWE-bench Verified — is appropriate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indirect contamination&lt;/strong&gt; — the test set is not in the training corpus, but related material is. MMLU-CF's reconstructed contamination-free version produced different model rankings from the original MMLU even when the &lt;em&gt;form&lt;/em&gt; of the questions was preserved. The implication is that the original MMLU's signal partly reflected familiarity with the surrounding domain text, not just the test items themselves. This is the form contamination most resistant to detection, because it doesn't show up as verbatim reproduction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indirect contamination through downstream artefacts&lt;/strong&gt; — this is the SWE-bench Verified case in its more interesting form. Verified is built from real GitHub issues from repos like &lt;code&gt;astropy/astropy&lt;/code&gt;, &lt;code&gt;django/django&lt;/code&gt;, &lt;code&gt;sympy/sympy&lt;/code&gt;. The test set isn't the only thing those repos contain; the repositories themselves, including the actual fixes for the chosen issues, are part of the training corpus for any model with a public-code crawl. The model doesn't need to see SWE-bench Verified to score well on a SWE-bench Verified task; it just needs to have read the repository the task was taken from. Filtering the training corpus against the &lt;em&gt;test set&lt;/em&gt; is straightforward; filtering against &lt;em&gt;every repository the test set was constructed from&lt;/em&gt; is much harder.&lt;/p&gt;

&lt;p&gt;The third form is what makes the saturation timeline accelerate. The field can construct each new benchmark to avoid direct contamination by keeping the test set itself private. It cannot construct each new benchmark to avoid indirect contamination through downstream artefacts unless the &lt;em&gt;source domain&lt;/em&gt; is also closed. Mathematics research papers, GitHub repositories, scientific abstracts, the entire OpenStax textbook corpus — these are training data, and any benchmark constructed from them inherits the contamination risk of the source.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "private held-out eval" actually means
&lt;/h2&gt;

&lt;p&gt;The labs' converging answer to public-benchmark saturation is the private held-out eval. Anthropic, OpenAI, Google, Scale AI, METR, Apollo Research, and Epoch AI all run private evaluation suites that they don't publish in answerable form. The economics of these evaluations are worth examining honestly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The case for private evals.&lt;/strong&gt; A test set kept private cannot be in any training corpus, by construction. The test items can be refreshed faster than models can be retrained. The evaluators can adversarially design new items to target observed model weaknesses. The published numbers are not contaminated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The case against, which the labs do not lead with.&lt;/strong&gt; A private eval is a published number with no independent verification path. &lt;em&gt;The lab claims the model scored 67% on our private eval&lt;/em&gt; is a sentence with a measurable difference from &lt;em&gt;the model scored 67% on a public benchmark anyone can re-run.&lt;/em&gt; The first is a marketing artefact. The second is a piece of evidence. The history of corporate self-reported benchmarks across every prior tech category — automotive fuel economy, hard-disk read speeds, network-equipment throughput, smartphone battery life — is one in which the published numbers and the independently-measured numbers differ in predictable directions. The same incentive structure applies to the labs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The intermediate solution that's emerged is the externally-managed private eval.&lt;/strong&gt; Scale AI runs evaluations with the test items held in escrow; only the labs' submissions and the resulting scores are published. Epoch AI's FrontierMath has the answers private but the problems published — researchers can see what's being asked but cannot game the answer-key directly. METR's autonomy evaluations are run by an external team with the lab's access to the test agent, but the test setup remains private. These are partial solutions. They depend on the evaluator's neutrality, the evaluator's funding model, and the evaluator's willingness to publish embarrassing numbers. None of these properties are guaranteed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The contamination-free public benchmark is a contradiction in terms.&lt;/strong&gt; Once a public benchmark is published, it's in the next training corpus. The half-life is bounded above by the model release cycle. This is not a fixable property; it's the consequence of training on the open web. The choice the field is making, slowly and without explicit framing, is between &lt;em&gt;public benchmarks with short useful lives&lt;/em&gt; and &lt;em&gt;private benchmarks with no independent verification.&lt;/em&gt; Neither is what the public-benchmark era pretended to offer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an actually-falsifiable benchmark would have to look like
&lt;/h2&gt;

&lt;p&gt;Listing the properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The test items must not be in any training corpus.&lt;/strong&gt; Strict definition: the items must have been generated &lt;em&gt;after&lt;/em&gt; every model's training cutoff under evaluation, on a refresh schedule faster than the model release cadence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The source domain must not contain answers either.&lt;/strong&gt; A benchmark drawn from public Stack Overflow inherits Stack Overflow contamination on indirect grounds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The evaluator must not be the developer.&lt;/strong&gt; Self-reported scores have a known bias direction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The evaluator's funding model must not be controlled by the developer.&lt;/strong&gt; Scale AI's evaluations are paid by the labs. This is structurally not a clean separation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The score must be reproducible by a third party.&lt;/strong&gt; A private eval that publishes only the score is one bit of information; it doesn't enable independent verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The benchmark must be refreshed.&lt;/strong&gt; A benchmark that doesn't refresh is on a saturation clock; the half-life of a frozen public benchmark is roughly the gap between two model generations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://livebench.ai/" rel="noopener noreferrer"&gt;LiveBench&lt;/a&gt; and &lt;a href="https://livecodebench.github.io/" rel="noopener noreferrer"&gt;LiveCodeBench&lt;/a&gt; attempt the refresh property — they generate or curate new test items monthly and publish results against the rolling window. &lt;a href="https://lmarena.ai/" rel="noopener noreferrer"&gt;Chatbot Arena&lt;/a&gt; (LMSYS) attempts the &lt;em&gt;user-generated test items&lt;/em&gt; property — every prompt comes from a real user interaction, so the test distribution is open-ended and not authorable in advance. Each gets one or two of the falsifiability properties above. None of them gets all six.&lt;/p&gt;

&lt;h2&gt;
  
  
  The summary that matches the data
&lt;/h2&gt;

&lt;p&gt;If I tabulate what's actually visible in the saturation evidence, the picture is unambiguous.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Published&lt;/th&gt;
&lt;th&gt;Top score 2025–2026&lt;/th&gt;
&lt;th&gt;Saturation lifespan&lt;/th&gt;
&lt;th&gt;Primary contamination concern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval&lt;/td&gt;
&lt;td&gt;2021-07&lt;/td&gt;
&lt;td&gt;96.3% (o1-preview)&lt;/td&gt;
&lt;td&gt;~36 months&lt;/td&gt;
&lt;td&gt;Direct: 164 problems publicly indexed since release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMLU&lt;/td&gt;
&lt;td&gt;2020-09&lt;/td&gt;
&lt;td&gt;mid-90s&lt;/td&gt;
&lt;td&gt;~36 months&lt;/td&gt;
&lt;td&gt;Direct: documented test-slot reproduction in 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA Diamond&lt;/td&gt;
&lt;td&gt;2023-11&lt;/td&gt;
&lt;td&gt;94.1% (Gemini 3.1 Pro Preview)&lt;/td&gt;
&lt;td&gt;~30 months&lt;/td&gt;
&lt;td&gt;Indirect: scientific literature is in training corpora&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;2024-08&lt;/td&gt;
&lt;td&gt;93.9% (Claude Mythos Preview)&lt;/td&gt;
&lt;td&gt;~18 months (OpenAI walked away Feb 2026)&lt;/td&gt;
&lt;td&gt;Indirect: training corpus includes the source repos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FrontierMath&lt;/td&gt;
&lt;td&gt;2024-11&lt;/td&gt;
&lt;td&gt;non-trivial fraction by end-2025&lt;/td&gt;
&lt;td&gt;~12 months to first signal&lt;/td&gt;
&lt;td&gt;Designed against direct contamination; indirect risk via mathematics literature&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARC-AGI-2&lt;/td&gt;
&lt;td&gt;2025-05&lt;/td&gt;
&lt;td&gt;84.6% public / 24% Kaggle-constrained&lt;/td&gt;
&lt;td&gt;12 months and counting&lt;/td&gt;
&lt;td&gt;Designed to resist scaling; the public-vs-constrained gap is the data point&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things stand out reading this table. The lifespan column is shrinking. The "top score" column has hit the high 80s or above on every benchmark in the table that isn't FrontierMath or ARC-AGI-2 — and even those are starting to move. The contamination-concern column has &lt;em&gt;no&lt;/em&gt; row that's clean; even benchmarks designed against direct contamination inherit indirect contamination from the source domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for reading benchmark numbers
&lt;/h2&gt;

&lt;p&gt;The published number on a public benchmark is informative for a specific window after publication and roughly noise after that window closes. HumanEval and MMLU and GPQA Diamond are at the ceiling. FrontierMath and ARC-AGI-2 are still informative, and won't stay informative for as long as their predecessors did.&lt;/p&gt;

&lt;p&gt;The honest reading of any 2026 frontier-model release is to look at &lt;em&gt;which benchmarks the lab is reporting&lt;/em&gt; and &lt;em&gt;which it is conspicuously not.&lt;/em&gt; OpenAI's silence on SWE-bench Verified is more informative than any number OpenAI is still publishing. Labs that report across the full saturated slate are doing well on benchmarks they know to be saturated; labs emphasising FrontierMath, ARC-AGI-2, or in-house held-out evals are differentiating on harder ground. The signal is in the choice.&lt;/p&gt;

&lt;p&gt;A benchmark is only useful for as long as it's hard, and &lt;em&gt;hard&lt;/em&gt; is the gap between the benchmark's source distribution and the model's training distribution — a gap that shrinks with every new corpus. The appropriate posture toward any score is to ask three things: the benchmark's age, the contamination evidence, the spread among the top ten models. Read together, they tell you whether the headline is information or wallpaper.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>evaluation</category>
      <category>benchmarks</category>
      <category>humaneval</category>
    </item>
  </channel>
</rss>
