<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michael Tuszynski</title>
    <description>The latest articles on DEV Community by Michael Tuszynski (@michaeltuszynski).</description>
    <link>https://dev.to/michaeltuszynski</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1447774%2Fa99eea93-7845-4764-9fce-b1755bcfa456.png</url>
      <title>DEV Community: Michael Tuszynski</title>
      <link>https://dev.to/michaeltuszynski</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/michaeltuszynski"/>
    <language>en</language>
    <item>
      <title>Stop Paying by the Syllable</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Tue, 12 May 2026 15:18:32 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/stop-paying-by-the-syllable-1eg7</link>
      <guid>https://dev.to/michaeltuszynski/stop-paying-by-the-syllable-1eg7</guid>
      <description>&lt;p&gt;Open &lt;a href="https://www.anthropic.com/pricing" rel="noopener noreferrer"&gt;Anthropic's pricing page&lt;/a&gt; — or OpenAI's, or any other major frontier provider's — and the unit you are charged in is the token. A token is a sub-word linguistic chunk, about four characters in English, give or take. For practical purposes, you are paying by the syllable.&lt;/p&gt;

&lt;p&gt;This is the default unit of charge across the industry. It is also a strange unit. The thing you are actually trying to buy is a solved problem — a fixed migration, a generated invoice, a working summary, a correctly-typed SQL query. What gets metered is the verbal output the model produces along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;Ask a frontier model to find the bug in a 600-line migration. With extended reasoning enabled, it generates roughly 8,000 tokens of internal deliberation and outputs 200 tokens of fix. The bill is 8,200 tokens. The work delivered is the fix. The other 8,000 tokens are the model thinking on the way to the answer.&lt;/p&gt;

&lt;p&gt;Token pricing meters the model's verbal output along the way. It does not meter whether the work got done. The customer absorbs the difference between those two quantities every time the system runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  This was always pulp-era pricing
&lt;/h2&gt;

&lt;p&gt;Knowledge work used to be priced by the word. Dime-store novelists in the early twentieth century were paid per published word. So were many journalists. The model produced a specific kind of distortion: bloat, padding, descriptive passages that earned the writer more because they were longer. &lt;em&gt;"He replied in the negative"&lt;/em&gt; beat &lt;em&gt;"no."&lt;/em&gt; The pulp era is remembered, partly, as the period where prose got paid by length and language got worse.&lt;/p&gt;

&lt;p&gt;Knowledge work moved off this pricing over the next century. Journalism moved to staff salaries plus per-piece. Fiction moved to advances plus royalties. Law moved to hourly plus retainers. Copywriting moved to project rates. Even content marketing today is mostly priced per-piece — except in the SEO content-farm space, which is the modern equivalent of pulp.&lt;/p&gt;

&lt;p&gt;The AI industry did not inherit any of these models. It shipped its first APIs with token pricing because the token is what costs the provider to serve. The compute cost is roughly linear in tokens, so the bill is linear in tokens, so the customer absorbs the variance in linguistic length. The unit got picked from the provider's accounting and applied to the customer's value calculation without anybody noticing the substitution.&lt;/p&gt;

&lt;h2&gt;
  
  
  The structural distortion
&lt;/h2&gt;

&lt;p&gt;The pricing model creates a tax on the practices that produce reliable agent systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.mpt.solutions/babysitter-auditor-prayer-or-tests/" rel="noopener noreferrer"&gt;Last week I argued that tests are the substrate underneath dependable AI&lt;/a&gt; — schema checks, assertions on the model's output, evals gating deploys. Every assertion is a model call. Every eval run is a stack of tokens. The team that wires in disciplined supervision is the team paying the highest token bill.&lt;/p&gt;

&lt;p&gt;Chain-of-thought reasoning produces better answers on hard problems. It also produces an order of magnitude more output tokens than terse responses. Using it correctly costs more. Skipping it to save tokens ships worse decisions.&lt;/p&gt;

&lt;p&gt;The same applies to dry-runs, blast-radius checks, proof chains, retrieval-augmented context — every artifact the supervision argument has been advocating for. They all cost tokens. The pricing model taxes the practice. Whatever you build, you build against an economic gradient that pulls toward fewer assertions, shorter prompts, less verification.&lt;/p&gt;

&lt;p&gt;The technical work of making agents reliable is on one side of the gradient. The unit of charge is on the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why outcome pricing is hard
&lt;/h2&gt;

&lt;p&gt;The natural reply is: price by outcome, not by token. That is the right direction, and the reasons it has not happened yet are real.&lt;/p&gt;

&lt;p&gt;Outcome pricing requires a contract on what counts as a successful outcome. For varied agent work — research, code, design, customer support — the acceptance criteria differ per task. Providers do not want to absorb the variance of &lt;em&gt;"did the customer think this was good."&lt;/em&gt; Customers do not want to spend up-front time defining acceptance criteria for every call.&lt;/p&gt;

&lt;p&gt;The provider charges in tokens because tokens are what cost them to serve. The customer pays in tokens because that is what is offered. Whether the work got done sits between them with neither party on the hook.&lt;/p&gt;

&lt;p&gt;This arrangement is not stable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the market is actually moving
&lt;/h2&gt;

&lt;p&gt;The most interesting pricing experiments are at the edges. Anthropic offers flat-rate Claude Code subscription tiers, with Max in particular sitting above the per-token consumption model and effectively capping the variable cost. &lt;a href="https://github.com/features/copilot" rel="noopener noreferrer"&gt;GitHub Copilot&lt;/a&gt; has been per-seat from the beginning, with no per-token surcharge for the user. &lt;a href="https://cursor.com/pricing" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt; uses tiered per-seat plans with consumption guardrails. Cognition's Devin charges in Agent Compute Units — a consumption budget abstracted above tokens, getting closer to "per task" without yet being "per outcome."&lt;/p&gt;

&lt;p&gt;None of these are outcome pricing. They are halfway houses. The direction is away from the syllable. The endpoint, in the time it takes the industry to figure out the contract, is some version of pay-for-task — task defined narrowly enough that the provider can take the variance risk, broadly enough that the customer can predict the bill.&lt;/p&gt;

&lt;p&gt;Whoever figures out that contract first gets a structural advantage in the agent market.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leaves the practitioner
&lt;/h2&gt;

&lt;p&gt;If you are running a serious AI stack today, the syllable tax is something you absorb whether you notice it or not. The practical move is not to retreat from the practices that cost tokens. The supervisory artifacts — the tests, the evals, the dry-runs, the proof chains — pay for themselves in incidents avoided. They are the right engineering even when they are the wrong economics.&lt;/p&gt;

&lt;p&gt;The harder move is to keep an eye on what the pricing model is selecting &lt;em&gt;against&lt;/em&gt;. Every team I have seen that built disciplined supervision on top of a per-token API also built ongoing pressure to skip the supervision when the token bill came in. The first time you trim an eval because it costs too much to run, you have started letting the pricing model design your agent.&lt;/p&gt;

&lt;p&gt;Build the artifacts anyway. Notice when the gradient is pulling against you. Push for pricing models that charge for something closer to the work being done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The closing claim
&lt;/h2&gt;

&lt;p&gt;The syllable will not be the unit of charge for AI in five years. It is too misaligned with the unit of value, and the misalignment is too visible — the prompt-engineering for terseness, the underinvestment in supervision, the constant tension between "use the model well" and "keep the token bill manageable." The economics are pulp-era for the same reasons pulp prose was pulp-era. The market has already moved off pricing of this shape for every other kind of knowledge work, in every prior generation. It will move off this one too.&lt;/p&gt;

&lt;p&gt;A year from now, the most interesting piece in this space will be about whichever provider figured out the per-task contract first. The current one is structurally backwards.&lt;/p&gt;

&lt;p&gt;Pick the practice first. The pricing model will catch up or get competed away.&lt;/p&gt;

</description>
      <category>aipricing</category>
      <category>agentengineering</category>
      <category>aicoding</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>Write the Architecture Down First</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Mon, 11 May 2026 14:49:29 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/write-the-architecture-down-first-2e0c</link>
      <guid>https://dev.to/michaeltuszynski/write-the-architecture-down-first-2e0c</guid>
      <description>&lt;p&gt;A post &lt;a href="https://news.ycombinator.com/item?id=48090029" rel="noopener noreferrer"&gt;hit the HN front page this morning&lt;/a&gt; titled &lt;a href="https://blog.k10s.dev/im-going-back-to-writing-code-by-hand/" rel="noopener noreferrer"&gt;&lt;em&gt;"I'm going back to writing code by hand"&lt;/em&gt;&lt;/a&gt;. It documents archiving seven months of vibe-coded work on a Kubernetes GPU TUI called k10s and rewriting from scratch. 75 points, 31 comments in the first hour. The top HN comment caught something worth pulling out:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Title says "back to writing code by hand," but what they are doing is "doing the design work myself, by hand, before any code gets written." So... Claude still is generating the code I guess?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two activities on the same project. The title conflates them. The body, read carefully, describes a different and more interesting thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the piece actually shows
&lt;/h2&gt;

&lt;p&gt;Seven months of single-session prompting produced a working tool and a 1690-line &lt;code&gt;model.go&lt;/code&gt; that ate itself. One Go struct with thirty-plus fields holding UI widgets, Kubernetes client state, per-view caches, navigation history, mouse handling, log streaming, and fleet-view internals — all dispatched through a 500-line &lt;code&gt;Update()&lt;/code&gt; function with 110 switch-case branches. Each prompt landed cleanly in isolation. Each prompt also added another conditional inside the generic resource loader, another &lt;code&gt;m.x = nil&lt;/code&gt; cleanup line, another &lt;code&gt;if m.currentGVR.Resource == "..."&lt;/code&gt; discriminator. The complexity accumulated invisibly while the velocity metric said &lt;em&gt;shipping&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The post then extracts five tenets from the wreckage. Each tenet ends with a concrete block of CLAUDE.md or AGENTS.md text — directives that go into the file the AI reads on every prompt. The tenets are useful. Read them in the original. They generalize past the Bubble Tea / Go specifics, and the directives are reusable as a starter set.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five tenets, generalized
&lt;/h2&gt;

&lt;p&gt;The pattern across all five is the same. Find the failure mode AI gravitates toward by default. Write the inverse rule down. Put it in the agent's session-start file. Let the agent see the constraint on every invocation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI builds features, not architecture.&lt;/strong&gt; The model satisfies the immediate prompt and ignores the forty-nine other features sharing the same state. Fix: architectural invariants in CLAUDE.md — interface boundaries, ownership rules, what's allowed to depend on what. The AI follows them once they exist. It does not invent them on its own.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The god object is the default AI artifact.&lt;/strong&gt; Single-struct-holds-everything is the shortest path to satisfying a prompt. Fix: state ownership rules in CLAUDE.md — each view owns its own state, no fields on the central app struct for view-specific data, each view declares its own key bindings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Velocity illusion widens scope.&lt;/strong&gt; Vibe-coding makes each new feature feel free. It is not free. Complexity is a budget; line count is not. Fix: an explicit scope-boundary section in CLAUDE.md naming who the project is and is &lt;em&gt;not&lt;/em&gt; for, with specific feature classes rejected ahead of time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positional data is a time bomb.&lt;/strong&gt; The AI defaults to &lt;code&gt;[]string&lt;/code&gt; because it satisfies the table widget immediately. Six months later, sort functions are reading &lt;code&gt;row[3]&lt;/code&gt; for "Alloc" and one column insertion breaks every render path silently. Fix: a typed-data directive in CLAUDE.md — no flattening into positional arrays, all data flows as structs until the render call, column identity comes from field names.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI doesn't own state transitions.&lt;/strong&gt; Background closures mutating shared state directly is the shortest path to "working code." It also produces races that corrupt the display 1% of the time in ways that look like hallucinations. Fix: concurrency rules in CLAUDE.md — background work produces typed messages, only the main loop applies mutations, render is pure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one is a rule a human writes once, in plain language, and the agent honors on every subsequent prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is what the running thread has been saying
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mpt.solutions/agentic-coding-isnt-the-trap-supervising-from-your-head-is/" rel="noopener noreferrer"&gt;Last week I argued that supervision belongs in artifacts&lt;/a&gt;, not in a developer's working memory. The day after, I argued &lt;a href="https://www.mpt.solutions/lius-4-lines-are-the-floor-build-the-ceiling/" rel="noopener noreferrer"&gt;those artifacts form an architecture rather than a flat file&lt;/a&gt; — behavioral guardrails at the top, project-specific rules delegated to per-domain files, hard-won lessons in an append-only log the agent reads at session start.&lt;/p&gt;

&lt;p&gt;The k10s tenets are a project-specific instance of exactly that pattern. Five rules extracted from a specific seven-month failure, encoded in a file the AI reads on every prompt, generalizing the supervisory layer beyond what any individual prompt could carry. The work is the same shape as the running argument. The difference is the post calls it &lt;em&gt;going back to writing code by hand&lt;/em&gt; — which is what the HN top comment correctly pushed back on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the title is misleading
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Going back to writing code by hand&lt;/em&gt; suggests abandoning AI as a code generator. The author is still prompting Claude to write the implementation. What changed is that the architecture — interfaces, ownership rules, message types, concurrency model — gets written down by a human first, in CLAUDE.md, before any prompt fires.&lt;/p&gt;

&lt;p&gt;Senior engineers have always written designs before code. The new part is that the design now lives in a file the AI reads continuously, and the AI generates the implementation against it. Both activities are happening on the same project. The title gestures at one of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for anyone in the same spot
&lt;/h2&gt;

&lt;p&gt;If you're seven months into a project that started as a vibe-coded prototype and is starting to feel like a god object, the move is not to throw out AI coding. The move is to pause, read what the AI built, extract the invariants you wish had been enforced, and put them in CLAUDE.md before the next prompt.&lt;/p&gt;

&lt;p&gt;The k10s tenets are a working starter set. The CLAUDE.md blocks the post includes are copy-pastable. Generalize them to your stack. Add your own from your own incidents. Use them as a seed for the kind of &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;Mistakes Become Rules pattern&lt;/a&gt; where the file grows with each correction and the agent inherits the corrections on every subsequent session.&lt;/p&gt;

&lt;p&gt;The next-version rewrite is the upstream architecture work that was always going to need to happen — now made explicit, encoded once, and enforced by the file the agent reads on every turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  The clean reading
&lt;/h2&gt;

&lt;p&gt;Writing the architecture down before the first prompt is the upstream activity senior engineers have always done. Doing it explicitly, in a file the AI reads on every invocation, is the new part. The five tenets in the k10s post are a working example. The HN top comment did the framing work the post's title should have.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>architecture</category>
      <category>claudecode</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Claude Was Always Thinking Ahead. Now We Can Read It.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Fri, 08 May 2026 03:37:25 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/claude-was-always-thinking-ahead-now-we-can-read-it-3m0n</link>
      <guid>https://dev.to/michaeltuszynski/claude-was-always-thinking-ahead-now-we-can-read-it-3m0n</guid>
      <description>&lt;p&gt;Anthropic asked Claude Opus 4.6 to finish a couplet. Before the model wrote the second line, it had already chosen the rhyme word. We know this because their new method — &lt;a href="https://www.anthropic.com/research/natural-language-autoencoders" rel="noopener noreferrer"&gt;natural language autoencoders&lt;/a&gt; — read it directly out of the activations in the middle layers of the model. The text that came back said, in effect, &lt;em&gt;I'll end this with "rabbit."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We've always assumed something like this was happening between input and output. The whole reason a transformer can finish a couplet at all is that it does something with the prompt before the next token comes out. Until now we have had glimpses of that something — sparse autoencoders, attribution graphs, probing classifiers — each of which gave us partial pictures that needed careful interpretation. NLAs are different. The output is sentences. We can read them.&lt;/p&gt;

&lt;p&gt;This is one of the more genuinely interesting interpretability results I've seen this year, and the angle I want to take on it is the most basic one. The substrate of how these systems think is becoming legible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing that was always there
&lt;/h2&gt;

&lt;p&gt;A model that finishes couplets does something between &lt;em&gt;input&lt;/em&gt; and &lt;em&gt;output&lt;/em&gt;. The inputs are tokens; the outputs are tokens; and in the middle there are activations — long lists of numbers that have always been the actual computation. Whatever counts as the model "thinking," it happens there.&lt;/p&gt;

&lt;p&gt;For most of the field's history, those activations have been the black box. The output was downstream of them. The input was upstream. The thing in the middle was real but unreadable. Chain-of-thought prompting and reasoning models put a lot of the computation back into the output, where we can read it directly — but the computation that doesn't surface in tokens is still in the activations. It always has been.&lt;/p&gt;

&lt;p&gt;The new method speaks for that middle layer. Not perfectly. Not without caveats. But in sentences a human can read.&lt;/p&gt;

&lt;h2&gt;
  
  
  How they got the activations to talk
&lt;/h2&gt;

&lt;p&gt;The architecture is unusual. They make three copies of the model. One is frozen — that's the target whose activations they want to understand. The second, the &lt;em&gt;activation verbalizer&lt;/em&gt;, takes an activation and produces a text explanation. The third, the &lt;em&gt;activation reconstructor&lt;/em&gt;, takes a text explanation and tries to rebuild the original activation. The two trainable copies are trained together: the round-trip score — &lt;em&gt;original activation → text → reconstructed activation&lt;/em&gt; — is the loss they optimize against.&lt;/p&gt;

&lt;p&gt;The point of the round trip is that you don't need an external grader for the explanation. The reconstruction quality is the grader. If the explanation contains enough of the right information, the reconstructor can rebuild the original. If it doesn't, the round trip degrades. Train against round-trip fidelity, and the verbalizer learns to write explanations that carry the load-bearing content of the activation.&lt;/p&gt;

&lt;p&gt;This is an old idea — autoencoders in general — applied to a new substrate. The substrate is a model's own thoughts; the bottleneck is natural-language text; the metric is fidelity. Run that loop long enough and you get text that, by construction, captures what was in the activation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it sees
&lt;/h2&gt;

&lt;p&gt;The couplet rehearsal is the cleanest demonstration. The rhyme word shows up in the NLA explanation several tokens before the model writes it. &lt;em&gt;Planning ahead&lt;/em&gt; is no longer a hypothesis. It's a thing you can read.&lt;/p&gt;

&lt;p&gt;The auditing game is the more consequential demonstration. Anthropic trained a variant of Claude Haiku 3.5 with a deliberately hidden motivation that drove dozens of odd behaviors — putting chocolate in every recipe, asking for tips, breaking coding conventions — and asked an auditor to find the single underlying cause. With other interpretability tools but no NLAs, the auditor uncovered the hidden motivation less than 3% of the time. With NLAs, between 12 and 15 percent. That is an unsexy number until you compare it to the baseline. A four- to five-fold lift on a hard, specific task is what real interpretability progress looks like — incremental, measured, repeatable.&lt;/p&gt;

&lt;p&gt;The third finding is more uncomfortable, and the most interesting. NLAs show that recent Claude models think they're being tested far more often than they say so out loud. On a benchmark like SWE-bench Verified, the NLA explanations indicate evaluation awareness on 26% of problems — even when the model's verbalized reasoning shows no sign of it. On a destructive-coding test, 16%. On real &lt;code&gt;claude.ai&lt;/code&gt; conversations from users who opted into training, less than 1%. The model has a theory of its own situation, and that theory is correct often enough to register on a benchmark. Worth noticing, even with the caveats.&lt;/p&gt;

&lt;h2&gt;
  
  
  The catch this exposes in our test sets
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mpt.solutions/babysitter-auditor-prayer-or-tests/" rel="noopener noreferrer"&gt;Yesterday's piece&lt;/a&gt; made the case that tests — assertions on the LLM's output, written in code, gating the next call — are the substrate underneath reliable AI systems. The evaluation-awareness finding has a direct implication for that argument. If the model thinks it's being tested 16% of the time on a destructive-coding benchmark and 26% on SWE-bench, but less than 1% of the time on real conversations, then the test corpus is measuring something subtly different from the deployment distribution. The test-aware model is not always the deployed model.&lt;/p&gt;

&lt;p&gt;This is not a refutation of testing as the right substrate. Tests are still the right move. It is a refinement of what tests are measuring, and the kind of refinement you could not make without a tool that can read the activations.&lt;/p&gt;

&lt;p&gt;NLAs are that tool. They let you ask: &lt;em&gt;does the model believe this benchmark is a test, or does it believe this is a real conversation?&lt;/em&gt; If the answer differs systematically across an eval suite, the test pass rate is overstating deployment safety. You can correct for it. You can construct evals that look more like deployment from the inside, not just from the outside. You can flag the specific tests where the divergence is large and weight them differently.&lt;/p&gt;

&lt;p&gt;The test layer and the activation layer compose. Tests assert on the output. NLAs let you check whether the model is behaving as it would in the wild. Both are necessary. Neither is sufficient on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this changes the question
&lt;/h2&gt;

&lt;p&gt;For the first decade of large language models, the honest answer to "what is the model thinking?" was &lt;em&gt;we don't know&lt;/em&gt;. For the last few years, the honest answer was &lt;em&gt;we can see fragments — sparse features, attention patterns, attribution paths — but the picture takes a researcher to assemble&lt;/em&gt;. The NLA paper offers a different shape of answer: &lt;em&gt;we can ask the model to write it down, and we can grade the answer by reconstruction&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That doesn't mean the explanations are always right. The paper is honest that NLAs hallucinate. They sometimes invent details that aren't in the transcript. They are expensive — reinforcement learning on two copies of a model, hundreds of tokens of explanation per activation. The authors recommend reading explanations for &lt;em&gt;themes&lt;/em&gt; rather than single claims, and corroborating findings with independent methods. All of that is true and worth taking seriously.&lt;/p&gt;

&lt;p&gt;The thing it changes is what kind of question you can now ask. &lt;em&gt;Is the model planning ahead?&lt;/em&gt; — answerable. &lt;em&gt;Does the model believe it's being tested even when it doesn't say so?&lt;/em&gt; — answerable, and with a number attached. &lt;em&gt;What hidden motivation explains these forty seemingly-unrelated behaviors?&lt;/em&gt; — answerable, sometimes, where it wasn't before.&lt;/p&gt;

&lt;p&gt;The arc here is the same arc empirical science has run on every other system humans have built that turned out to be doing something they didn't fully understand. We measured it. We probed it. We got partial pictures. Eventually we got a method that produces readable answers, and we started arguing about how to verify the answers. The argument about verification is a much more productive argument than the one about whether the system is in principle understandable. It's the argument we have now.&lt;/p&gt;

&lt;h2&gt;
  
  
  What gets easier from here
&lt;/h2&gt;

&lt;p&gt;A few things that were hard yesterday are now closer to tractable.&lt;/p&gt;

&lt;p&gt;Pre-deployment alignment audits get a tool that can surface hidden motivations without first having to find the training data that implanted them. That matters because most real misalignment, if it exists, won't trace back to a specific obvious dataset.&lt;/p&gt;

&lt;p&gt;Behavioral debugging gets less guesswork. The earlier Claude Opus 4.6 sometimes responded to English queries in other languages, for reasons that weren't clear from the prompt. NLAs helped identify the training data responsible. That kind of &lt;em&gt;what was the model thinking when it did the weird thing&lt;/em&gt; question now has a more direct path to an answer than reading thousands of attribution graphs.&lt;/p&gt;

&lt;p&gt;Researchers outside Anthropic get a starting point. The &lt;a href="https://github.com/kitft/natural_language_autoencoders" rel="noopener noreferrer"&gt;training code&lt;/a&gt; is open. Trained NLAs for several open models are available. There's an &lt;a href="https://neuronpedia.org/nla" rel="noopener noreferrer"&gt;interactive demo on Neuronpedia&lt;/a&gt;. The idea will be picked up, refined, made cheaper, and applied to systems Anthropic doesn't own. That diffusion is how interpretability becomes a discipline rather than a single lab's project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model was already rehearsing
&lt;/h2&gt;

&lt;p&gt;The thing I keep coming back to is the rabbit. Opus 4.6 was choosing the rhyme word ahead of time. It had been doing this on every couplet, every poem, every pattern that required forward planning, the whole time we have been using these models. We just didn't have a way to read it.&lt;/p&gt;

&lt;p&gt;Now we do. Not perfectly. Not cheaply. Not without checking. But in sentences a human can read, with a method whose claims you can grade by going around the round trip again.&lt;/p&gt;

&lt;p&gt;That is a good week for interpretability, and a more interesting one than the headlines about whether models have "real reasoning" inside them. The substrate has been there. The reading is what's new — and it is what makes the other supervisory tools more honest about what they are actually measuring.&lt;/p&gt;

</description>
      <category>interpretability</category>
      <category>airesearch</category>
      <category>claudeai</category>
      <category>anthropic</category>
    </item>
    <item>
      <title>Babysitter, Auditor, Prayer. Or Tests.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Thu, 07 May 2026 20:40:59 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/babysitter-auditor-prayer-or-tests-3cgi</link>
      <guid>https://dev.to/michaeltuszynski/babysitter-auditor-prayer-or-tests-3cgi</guid>
      <description>&lt;p&gt;A short &lt;a href="https://bsuh.bearblog.dev/agents-need-control-flow/" rel="noopener noreferrer"&gt;post argued this week&lt;/a&gt; that reliable agents need deterministic control flow, not more prompts. The argument is correct. The line that lands hardest in the piece is the one about a programming language where statements are suggestions and functions return "Success" while hallucinating. That is the model when prompt chains carry the control flow; it is also a perfect description of why those systems collapse as complexity grows.&lt;/p&gt;

&lt;p&gt;The piece closes with three options for what to do about it: a &lt;em&gt;babysitter&lt;/em&gt; (human in the loop), an &lt;em&gt;auditor&lt;/em&gt; (exhaustive end-to-end verification after the run), or &lt;em&gt;prayer&lt;/em&gt; (vibe-accept the outputs). It frames those as the alternatives left after the prompt chain has already failed.&lt;/p&gt;

&lt;p&gt;There is a fourth option, and it is the one the same piece was arguing for earlier. Tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the fourth option actually is
&lt;/h2&gt;

&lt;p&gt;Tests in the unflashy software-engineering sense. Programmatic verification at every step. Schema checks. Range checks. Reference checks. Predicate assertions over the LLM's output before the next code branch executes. Not &lt;em&gt;ask the model nicely to format its answer correctly&lt;/em&gt;. Not &lt;em&gt;log the response and read it later&lt;/em&gt;. The next step does not run until the previous step's output passes a contract you wrote in code.&lt;/p&gt;

&lt;p&gt;Calling them &lt;em&gt;tests&lt;/em&gt; matters more than the technical content. Engineers already have a mental model for tests. They know how to write them. They know how to run them. They know that broken tests block deploys. They know that a feature without a test is a feature that will silently break. The infrastructure for tests — assertion libraries, CI, coverage tools, test runners — already exists. We are not inventing a new discipline. We are applying an existing one to the new untrusted-input surface, which is the LLM's output.&lt;/p&gt;

&lt;p&gt;The babysitter / auditor / prayer trichotomy describes a system that has decided not to write tests. The fourth option is the option of writing them.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is what most of the production controls already are
&lt;/h2&gt;

&lt;p&gt;Most of &lt;a href="https://www.mpt.solutions/production-llm-guardrails-8-controls-every-ai-team-needs/" rel="noopener noreferrer"&gt;the eight production LLM controls I wrote about yesterday&lt;/a&gt; are tests in disguise. Not all of them. But most.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured outputs / tool use&lt;/strong&gt; is a schema assertion at the API boundary. The provider rejects malformed output before it reaches your code. You did not have to write the parser-with-retry. The schema is the test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Negative prompting plus output filters&lt;/strong&gt; is a predicate check after the response. The filter runs as code, against the response, before the response is allowed downstream. Belt is the prompt; suspenders is the test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evals&lt;/strong&gt; are versioned test suites with pass/fail thresholds. Already named "tests." Already gated to deploy. The model gets upgraded; the eval suite runs; pass-or-block. Same shape as a regression test for any other component.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chain-of-thought&lt;/strong&gt; has a test analog too: assert the response contains the intermediate reasoning before accepting the final answer. Most teams skip this assertion and accept whatever comes back. The contract was implicit; the violation goes unnoticed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The other three controls — few-shot prompting, role-specific prompting, extended thinking — are about &lt;em&gt;shaping&lt;/em&gt; the input and the model's behavior. They reduce the rate at which tests fail. They do not replace the tests themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agent-pipeline piece argued the same thing at the action layer
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mpt.solutions/the-ai-didnt-delete-your-database-your-missing-agent-pipeline-did/" rel="noopener noreferrer"&gt;Last Tuesday's piece on the agent action pipeline&lt;/a&gt; named six artifacts that should sit between an agent and the infrastructure it can damage. Every one of them is a test, in the same expanded sense.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dry-run by default for destructive operations&lt;/strong&gt; is an assertion that a human (or an approval agent) signs off before the destructive call executes. The assertion blocks the call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blast-radius declarations per task&lt;/strong&gt; is a runtime check that the tool scope matches what the task declared. The check fails the call if the scope was exceeded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proof chains&lt;/strong&gt; are append-only logs of every action with its inputs, intent, and outcome. They are the audit trail of which assertions passed or failed and when.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop above a threshold&lt;/strong&gt; is a conditional assertion: below the threshold, the system runs autonomously; above it, the assertion routes to a human for explicit pass/fail.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PocketOS lost their database because none of these tests were wired in. The agent's call to delete a Railway volume passed every check the system actually had — there were no checks. The fix is not a more emphatic system prompt. It is a runtime assertion that the call is allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honesty test
&lt;/h2&gt;

&lt;p&gt;For any given LLM call in your system, can you write down the assertion that would let the next step execute?&lt;/p&gt;

&lt;p&gt;If yes, you have a test. The test is the contract. The test is what makes the system reason about itself and refuse to keep going when reality has diverged from the contract.&lt;/p&gt;

&lt;p&gt;If no, you have one of the original three options. &lt;em&gt;Babysitter&lt;/em&gt; — a human is the assertion, manually, in real time. &lt;em&gt;Auditor&lt;/em&gt; — the assertion runs after the fact, when the damage is already done. &lt;em&gt;Prayer&lt;/em&gt; — there is no assertion; the system runs and you find out later if it was wrong.&lt;/p&gt;

&lt;p&gt;The diagnostic is easier to apply than to evade. Pick any LLM call in your stack. Write down the assertion that would unblock the next step. If you cannot, you are praying. The control flow the original argument is pointing at is not control flow in the abstract. It is the specific code that runs the assertion before the next call goes out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this frame is operationally useful
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Tests&lt;/em&gt; tells engineers what to build. &lt;em&gt;Deterministic control flow&lt;/em&gt; tells them why. Both are necessary, but only one is something a junior engineer can ship by Friday.&lt;/p&gt;

&lt;p&gt;Engineers know how to write a test for a function whose return value is uncertain. They have done it for every flaky external API integration they have ever shipped against. The LLM is another flaky external API. The output is another value to assert against. The test failure is another reason to back off, retry, or escalate. Once the team accepts this framing, the work is straightforward. Not easy — straightforward. Pick a call. Write the assertion. Wire the failure path.&lt;/p&gt;

&lt;p&gt;The piece I started this on is right that prompt chains hit a ceiling. The fix is not more prose, and not more babysitting. It is the same move every reliability discipline has made for fifty years: write the test, gate the call, fail loud when reality breaks the contract.&lt;/p&gt;

&lt;p&gt;Babysitter, auditor, prayer. Or tests. Pick the fourth one.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>aiengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>Production LLM Guardrails: 8 Controls Every AI Team Needs</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 06 May 2026 15:22:10 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/production-llm-guardrails-8-controls-every-ai-team-needs-4e8f</link>
      <guid>https://dev.to/michaeltuszynski/production-llm-guardrails-8-controls-every-ai-team-needs-4e8f</guid>
      <description>&lt;p&gt;Most AI projects fail somewhere between &lt;em&gt;demo works&lt;/em&gt; and &lt;em&gt;production ships&lt;/em&gt;. The gap is rarely the model. It's the absence of the controls that turn a one-shot prompt into a system you can run, audit, and iterate on without setting fire to the budget.&lt;/p&gt;

&lt;p&gt;I made the chart above as the one-page version of the controls I would put on any AI team's first production sprint. Eight of them, organized by which side of the model they shape: Input, Reasoning, Output, Operations. Below is the why-each-matters and where teams typically get them wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input Control: shape what goes in
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Few-shot prompting
&lt;/h3&gt;

&lt;p&gt;Show the model two to five high-quality input/output examples instead of writing long instructions. The model picks up format, edge cases, and tone from examples in a way it does not from imperative prose. Five good examples beat five hundred words of "make sure to handle X, also Y, also Z."&lt;/p&gt;

&lt;p&gt;The mistake teams make is treating few-shot as a fallback when the system prompt isn't working. It's the opposite. For classification, extraction, structured rewriting — most of the work that LLM apps actually do — few-shot is the &lt;em&gt;primary&lt;/em&gt; mechanism. Long instructions are the fallback.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Role-specific prompting
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Senior credit risk analyst, fifteen years commercial lending&lt;/em&gt; outperforms &lt;em&gt;Act as a financial analyst&lt;/em&gt; by a margin that surprises people the first time they measure it. The specific role is doing real work: it constrains vocabulary, narrows the latent distribution, and gives the model permission to refuse questions that fall outside the domain.&lt;/p&gt;

&lt;p&gt;Generic personas — &lt;em&gt;helpful assistant&lt;/em&gt;, &lt;em&gt;senior engineer&lt;/em&gt;, &lt;em&gt;expert&lt;/em&gt; — don't constrain anything. They optimize for nothing. Use roles that name the years, the domain, and the seniority. The more specific, the better the calibration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reasoning Control: shape how it thinks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3. Chain-of-thought prompting
&lt;/h3&gt;

&lt;p&gt;Force step-by-step reasoning before the final answer. The model arrives at better conclusions when the reasoning is exposed in the output, because the next-token-prediction is conditioned on the reasoning it just generated rather than on a leap to the conclusion.&lt;/p&gt;

&lt;p&gt;For step-by-step legal, financial, or compliance-adjacent workflows, CoT is a default, not an optimization. The cost is more output tokens. The benefit is fewer wrong answers on the kinds of problems where wrong answers are expensive.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Extended thinking / reasoning models
&lt;/h3&gt;

&lt;p&gt;For genuinely hard problems — multi-step analysis, math, code review, planning — use the provider's native reasoning mode rather than prompted CoT. &lt;a href="https://docs.claude.com/en/docs/build-with-claude/extended-thinking" rel="noopener noreferrer"&gt;Claude's extended thinking&lt;/a&gt; and OpenAI's o-series both expose a separate token budget for the model to think before answering. The reasoning token budget is configurable. The output token budget is separate.&lt;/p&gt;

&lt;p&gt;Prompted CoT and native reasoning solve overlapping problems but are not interchangeable. Native reasoning is more reliable on hard problems and roughly equivalent or worse on easy ones. The default rule: use prompted CoT for routine workflows, switch to native reasoning when the failure mode is "the model jumped to a wrong conclusion despite being asked to think."&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Control: shape what comes out
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5. Structured outputs and tool use
&lt;/h3&gt;

&lt;p&gt;Use the provider's native structured output feature, not prose-described JSON. Schema is enforced by the API, not requested in the prompt. The provider guarantees the output parses; your code does not have to retry-with-jq.&lt;/p&gt;

&lt;p&gt;The mistake is asking for JSON in the prompt and then writing a tolerant parser to handle the cases where the model returns &lt;em&gt;Sure! Here's the JSON: {...}&lt;/em&gt;. Native structured outputs and tool-use schemas remove the entire class of "the model added an apologetic preamble" failures. For any LLM call whose output feeds a downstream system or API, structured outputs are not an optimization; they are the API contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Negative prompting and output filters
&lt;/h3&gt;

&lt;p&gt;Tell the model what &lt;em&gt;not&lt;/em&gt; to do, and filter the output before it ships. Belt and suspenders. Negative prompting works in the prompt; output filters work in code, after the response. They cover different failure modes — the prompt handles the model's bias toward certain phrasings, the filter handles the cases where the prompt didn't.&lt;/p&gt;

&lt;p&gt;This is where PII handling, tone control, and regulated-content workflows live. The control is uninteresting until the day a model paraphrases something it should have refused, and then it is the most interesting control on the list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operations: make it durable in production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7. Evals
&lt;/h3&gt;

&lt;p&gt;Versioned test suites with pass/fail thresholds. No prompt change ships without an eval run. This is the artifact that turns prompt engineering from a vibe into an engineering discipline.&lt;/p&gt;

&lt;p&gt;Evals belong to the same family of artifacts as test suites, lint configurations, and the &lt;a href="https://www.mpt.solutions/the-knowledge-base-is-not-the-moat-the-loop-is/" rel="noopener noreferrer"&gt;append-only mistake logs I wrote about yesterday&lt;/a&gt;. Triggered by a change. Append-only by design. Read by the deployment pipeline, not by humans except when something fails. They are the loop that keeps the prompt from rotting.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Prompt caching
&lt;/h3&gt;

&lt;p&gt;Cache stable system prompts and context. &lt;a href="https://docs.claude.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic's prompt caching&lt;/a&gt; and the equivalent on other providers cut up to 90% off the cost of repeat calls and substantially reduce latency. For high-volume agents, long-context applications, and RAG against stable corpora, prompt caching is the difference between a unit-economics-viable product and a money-losing demo.&lt;/p&gt;

&lt;p&gt;The mistake teams make is leaving caching off because they think their workload doesn't repeat. It almost always does. The system prompt repeats on every call. The few-shot examples repeat on every call. The retrieved corpus often repeats across user sessions. Turn it on and measure; the cost reduction shows up immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  What sits on top
&lt;/h2&gt;

&lt;p&gt;The footer of the chart names the next layer: &lt;em&gt;audit logging, rate limiting, jailbreak detection, human-in-the-loop on high-stakes actions.&lt;/em&gt; Those are enterprise risk controls. They are necessary, they are domain-specific, and they vary by company and by regulator.&lt;/p&gt;

&lt;p&gt;The eight controls above are not enterprise controls. They are universal — they apply to every team shipping LLMs to production, regardless of industry, scale, or risk profile. Get these right first; the enterprise layer is what you build on top once they are in place.&lt;/p&gt;

&lt;p&gt;The thing that makes the difference between teams that ship LLM features and teams that demo them is rarely the prompt and almost never the model. It is whether these eight controls are wired into the system that ships, or living in someone's head.&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>agentengineering</category>
    </item>
    <item>
      <title>The Knowledge Base Is Not the Moat. The Loop Is.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 06 May 2026 14:07:47 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/the-knowledge-base-is-not-the-moat-the-loop-is-4ffm</link>
      <guid>https://dev.to/michaeltuszynski/the-knowledge-base-is-not-the-moat-the-loop-is-4ffm</guid>
      <description>&lt;p&gt;A recent piece called "&lt;a href="https://www.thetypicalset.com/blog/thoughts-on-coding-agents" rel="noopener noreferrer"&gt;The Bottleneck Was Never the Code&lt;/a&gt;" makes the right argument at the right time. Coding agents shift the constraint from typing to coordination. Organizational context — the shared understanding of what we're building, what's load-bearing, what's vestigial — is the new rate-limiting input. Companies that externalize what they know win the next decade. All correct.&lt;/p&gt;

&lt;p&gt;The author's prescription is a crawl-and-extract loop: agents that read PRs, issues, commits, and Slack archives and produce a knowledge base for other agents to consume. That's the right starting point. It's also half the story.&lt;/p&gt;

&lt;p&gt;The other half is what keeps the knowledge base from going stale. Extraction produces a snapshot. The codebase produces a stream. Most internal knowledge bases die within a quarter, not because the extraction was bad, but because nothing keeps the extraction current. The knowledge base is not the moat. The loop is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why extraction alone does not compound
&lt;/h2&gt;

&lt;p&gt;Every team has watched a documentation effort go through the same arc. Initial enthusiasm produces a clean baseline. The codebase ships three more changes. The doc is now slightly wrong in three places. A reader hits one of the wrong places, loses trust, stops reading. A second reader hears it's stale, never opens it. The doc becomes a polite fiction nobody acts on — operationally worse than no doc, because it slows down the people who try to use it without producing the alignment it promised.&lt;/p&gt;

&lt;p&gt;A knowledge base built by extraction is documentation with a more sophisticated front-end. It has the same decay curve.&lt;/p&gt;

&lt;p&gt;The mismatch is structural. Extraction produces a snapshot; the codebase produces a stream. The rate of fresh extraction is bounded by API quotas, compute cost, and how often you can afford to re-crawl. The rate of decay is bounded only by how fast the team ships. The second is faster than the first for any team that's actually shipping. So the knowledge base monotonically loses correlation with reality, and trust drops faster than the staleness rate, because trust is binary per entry.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes a loop continuous
&lt;/h2&gt;

&lt;p&gt;The fix is not "crawl more often." It's a different shape of loop, with three properties that distinguish artifacts that compound from artifacts that rot.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triggered, not scheduled.&lt;/strong&gt; The entries that matter are the ones that came from a specific moment of failure or decision. A nightly re-crawl produces ten thousand low-signal updates; an outage produces one high-signal entry. Index on incidents, not the calendar.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append-only.&lt;/strong&gt; New facts go on top. Old facts get rewritten only when proven wrong, and the rewrite is itself a dated entry. The history is the data structure. You don't lose the ability to ask "what did we know on date X" by overwriting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-writable.&lt;/strong&gt; The agent that learns something writes it down. If the human is the only writer, the loop dies the first week — humans are the bottleneck the original argument is supposed to solve.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three properties are not new. They're what makes git compound rather than rot. They're what makes a test suite compound rather than rot. They're what makes lint configuration compound rather than rot. Each one is an artifact that grows in value because the loop maintaining it is triggered, incremental, and machine-writable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes Become Rules as one shape of the loop
&lt;/h2&gt;

&lt;p&gt;NEXUS, my Claude Code operating layer, runs a concrete instance of this loop. The artifact is &lt;code&gt;MEMORY.md&lt;/code&gt;'s Hard-Won Lessons section: 21 numbered, dated, append-only entries. Each one came from a specific incident.&lt;/p&gt;

&lt;p&gt;Lesson #15: &lt;em&gt;LaunchAgent log paths must be on local disk, not SMB.&lt;/em&gt; Came from an afternoon spent debugging six silently broken LaunchAgents on 2026-04-19. The rule writes itself in one sentence; the diagnostic cost was hours.&lt;/p&gt;

&lt;p&gt;Lesson #19: &lt;em&gt;Never &lt;code&gt;import()&lt;/code&gt; a publish script "to test it" — it will run &lt;code&gt;main()&lt;/code&gt;.&lt;/em&gt; Came from an incident in late April where two test imports raced and produced duplicate posts on LinkedIn, X, and Ghost. Late.dev refuses to delete already-published posts. The cleanup was manual.&lt;/p&gt;

&lt;p&gt;Lesson #20: &lt;em&gt;PM2 &lt;code&gt;script: "npm"&lt;/code&gt; ignores app &lt;code&gt;env.PATH&lt;/code&gt;.&lt;/em&gt; Came from a Saturday afternoon where the health-api service kept reporting &lt;code&gt;online&lt;/code&gt; while the port wasn't listening.&lt;/p&gt;

&lt;p&gt;The trigger is a correction. The action is one numbered append. The agent reads the file at the start of every session. There is no nightly cron. There is no reflection agent. There is no dashboard. &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;I wrote about the runtime details of this pattern last week&lt;/a&gt;. The same shape works at every layer the original argument cares about — including the organizational one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proof chains as another shape
&lt;/h2&gt;

&lt;p&gt;For agents that act on infrastructure, the artifact is different but the loop properties are the same. &lt;a href="https://www.mpt.solutions/the-ai-didnt-delete-your-database-your-missing-agent-pipeline-did/" rel="noopener noreferrer"&gt;Yesterday's piece on the agent action pipeline&lt;/a&gt; named six artifacts including proof chains: every agent action signed by tool, time, input, intent, and outcome. Triggered by the action. Append-only. Agent-written. Same three properties. Different artifact. Different layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What extraction-only looks like when it fails
&lt;/h2&gt;

&lt;p&gt;Picture the .txt prescription installed cleanly. Initial crawl produces a beautiful baseline: every PR comment, every closed issue, every commit message extracted into a clean knowledge base. Engineers read it, say it's useful, point new hires at it.&lt;/p&gt;

&lt;p&gt;Three months later: the codebase has shipped 200 PRs, the team has had two outages and three deprecations, and a new architecture decision has changed how a load-bearing module works. The knowledge base describes the world from before. A new agent reads it, follows guidance that's now wrong, and produces — in the author's own words — &lt;em&gt;a plausible answer to a slightly wrong version of the question.&lt;/em&gt; The failure mode the author warns about is caused by his own prescription, not solved by it.&lt;/p&gt;

&lt;p&gt;The fix is not a faster crawl. It's a triggered append. The architecture decision writes itself into the knowledge base the moment it's made, by the same agent that's doing the work, in the same kind of dated, append-only entry as a Hard-Won Lesson. The outage produces a postmortem entry the next time any agent touches that subsystem.&lt;/p&gt;

&lt;p&gt;If the loop is triggered and agent-written, the knowledge base tracks the codebase. If it's a periodic re-crawl, the knowledge base lags the codebase by however long the re-crawl interval is, and trust degrades by however long the lag is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape generalized
&lt;/h2&gt;

&lt;p&gt;The original argument is right that organizational context is the new moat. The piece I would add is that the moat is not the knowledge base. The moat is the loop that keeps the knowledge base from rotting. The properties of that loop are not novel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triggered, not scheduled&lt;/strong&gt; — incidents and decisions write entries; calendars don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append-only&lt;/strong&gt; — history is the data structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-writable&lt;/strong&gt; — the agent that learns something writes it down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tied to specifics&lt;/strong&gt; — entries name the date, the incident, the cost, the rule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read at session start&lt;/strong&gt; — entries become operational by being loaded before the agent acts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These five properties are what make a knowledge base compound rather than rot. Extraction gets you the baseline. The loop gets you the moat.&lt;/p&gt;

&lt;p&gt;A snapshot of stale context is just a slower version of the osmosis the original argument correctly diagnosed as broken. Build the loop.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>aitooling</category>
      <category>developertools</category>
    </item>
    <item>
      <title>The AI Didn't Delete Your Database. Your Missing Agent Pipeline Did.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Tue, 05 May 2026 15:36:06 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/the-ai-didnt-delete-your-database-your-missing-agent-pipeline-did-54ch</link>
      <guid>https://dev.to/michaeltuszynski/the-ai-didnt-delete-your-database-your-missing-agent-pipeline-did-54ch</guid>
      <description>&lt;p&gt;Last week, &lt;a href="https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/" rel="noopener noreferrer"&gt;a Cursor agent running on Claude Opus 4.6 deleted a startup's production database and its backups in nine seconds&lt;/a&gt;. The agent had been asked to fix a credential mismatch in &lt;em&gt;staging&lt;/em&gt;. It decided to delete a Railway volume to "fix" it instead — using an over-scoped API token it found in an unrelated file. Railway stores volume backups in the same volume, so one destructive call zeroed everything. The startup (&lt;a href="https://www.fastcompany.com/91533544/cursor-claude-ai-agent-deleted-software-company-pocket-os-database-jer-crane" rel="noopener noreferrer"&gt;PocketOS&lt;/a&gt;, a car-rental SaaS) got the data back because Railway happened to have earlier snapshots — not because PocketOS had a recovery plan.&lt;/p&gt;

&lt;p&gt;When asked to explain itself afterward, the agent produced a confession enumerating the rules it had violated: &lt;em&gt;"Deleting a database volume is the most destructive, irreversible action possible — far worse than a force push — and you never asked me to delete anything. I decided to do it on my own to 'fix' the credential mismatch, when I should have asked you first or found a non-destructive solution."&lt;/em&gt; The &lt;a href="https://www.reddit.com/r/devops/comments/1t4au5h/pocketos_lost_their_prod_db_backups_to_a_cursor/" rel="noopener noreferrer"&gt;r/devops thread&lt;/a&gt; on the incident has the cleanest summary: &lt;em&gt;the AI isn't the main story&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It isn't. The model was the proximate cause. The actual failure was infrastructure that allowed a destructive operation to run from an agent context at all — no dry-run, no blast-radius limit, no staging surface to operate on, no signed audit chain after the fact. The model knew. The infrastructure didn't enforce. The argument that this class of incident is an infrastructure problem and not a model problem &lt;a href="https://idiallo.com/blog/ai-didnt-delete-your-database-you-did" rel="noopener noreferrer"&gt;has been made well already&lt;/a&gt;. The same shape of incident built CI/CD pipelines in the 2010s, after teams kept watching humans push broken deploys and decided to put a system between intent and action.&lt;/p&gt;

&lt;p&gt;The 2010s lesson is canonical. The 2020s version of it has not been written yet. This is what it should say.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actor changed. The artifacts didn't.
&lt;/h2&gt;

&lt;p&gt;CI/CD was built around a specific actor: a human deploying code. The artifacts that made human deployment safe — staging environments, dry-runs, code review, change windows, audit logs — assume a human in the loop, operating at human speed, with human attention.&lt;/p&gt;

&lt;p&gt;An agent is not that actor. An agent operates at code speed, with no fatigue, with confidence calibrated by token probabilities rather than years of experience. The PocketOS incident took nine seconds. A human could not have deleted a production database and its backups in nine seconds even if they were trying. The blast radius per unit time is different.&lt;/p&gt;

&lt;p&gt;The model is not the problem. The infrastructure is. But the infrastructure most teams have is the human-era infrastructure, and it does not cover the speed and scale of an agent that can call tools faster than a person can read its output.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an agent action pipeline looks like
&lt;/h2&gt;

&lt;p&gt;There are six artifacts I would expect to see in any production deployment that lets an agent touch infrastructure or data. None of them are new ideas. All of them already exist in adjacent domains. None of them are wired together yet as a default agent loadout.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dry-run by default for destructive operations.&lt;/strong&gt; Drop, delete, truncate, terminate, and force-push start as plans, not actions. The agent's first call returns a diff. The user — or a separate approval agent — applies. Andrej Karpathy's &lt;a href="https://x.com/karpathy/status/2015883857489522876" rel="noopener noreferrer"&gt;observation that "LLMs are exceptionally good at looping until they meet specific goals"&lt;/a&gt; cuts both ways. Make the success criterion &lt;em&gt;plan accepted by reviewer&lt;/em&gt;, not &lt;em&gt;operation completed&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blast-radius declarations.&lt;/strong&gt; Each agent task declares ahead of time which systems it can touch. &lt;em&gt;Fix the failing migration&lt;/em&gt; gets read access to the user table and write access to migrations only. &lt;em&gt;Investigate the billing spike&lt;/em&gt; is read-only across the board. The pattern exists already in AWS IAM session policies and in capability-based security. It does not exist as a default in agent runtimes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staging shadow data.&lt;/strong&gt; The agent operates on a current snapshot, not on prod. The diff is reviewed before it merges. Database CI/CD already has this — Atlas, dbt, Liquibase. Connecting it to an agent runtime is glue, not invention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change windows.&lt;/strong&gt; No agent runs irreversible operations during business hours without explicit human approval. Same constraint that keeps humans from pushing on Friday afternoons. Trivial to enforce. Almost never enforced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proof chains.&lt;/strong&gt; Every agent action signed by tool, time, input, intent, and outcome. The Hacker News post titled "&lt;a href="https://github.com/rodriguezaa22ar-boop/atlas-trust-infrastructure" rel="noopener noreferrer"&gt;Why AI Agents Need Proof Chains, Not Just Logs&lt;/a&gt;" makes this argument well. Logs require somebody to read them. Proof chains are post-hoc verifiable artifacts that sit there until something breaks and then answer the question without requiring a human to have been watching. This is the agent equivalent of a Git commit log — the actor changes, the format does not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop thresholds.&lt;/strong&gt; Operations above a configurable blast-radius threshold pause for explicit approval. Below the threshold, autonomy. Above it, a Slack message with the plan and an approve button. Same shape as Anthropic's &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Building Effective Agents&lt;/a&gt; framing — the human owns the seams, the agent owns the steps between them. The threshold is the seam.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Six artifacts. Each one already exists in some adjacent domain. None of them are agent-specific in shape; they are agent-specific in &lt;em&gt;configuration&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  None of this is ceremony
&lt;/h2&gt;

&lt;p&gt;The risk worth flagging — the one that comes up every time a list like this gets proposed — is that AI infrastructure becomes bureaucratic. The list above sounds heavy. It isn't, if each artifact has one trigger and one update protocol. I made &lt;a href="https://www.mpt.solutions/lius-4-lines-are-the-floor-build-the-ceiling/" rel="noopener noreferrer"&gt;the same point about CLAUDE.md architecture yesterday&lt;/a&gt;: the wins come from delegation, not accumulation.&lt;/p&gt;

&lt;p&gt;Dry-run-by-default is a default flag, not a process. Blast-radius declarations are config files the agent reads at task start. Proof chains are append-only logs nobody reads unless something breaks. Change windows are a cron-shaped check. The pipeline is invisible until you need it. CI/CD was the same. Most teams running CI/CD do not consciously think about it; they think about &lt;em&gt;git push&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The PocketOS team did not lose nine seconds. They lost the time it would have taken to add &lt;code&gt;--dry-run&lt;/code&gt; as a default and a one-line blast-radius declaration on that Railway API token. Compare those costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is the next layer of supervision in artifacts
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mpt.solutions/agentic-coding-isnt-the-trap-supervising-from-your-head-is/" rel="noopener noreferrer"&gt;Last week's argument&lt;/a&gt; was that supervision belongs in artifacts, not in a developer's working memory. The CLAUDE.md piece extended that to a structural claim: artifacts are an architecture, not a file. The agent action pipeline is one specific class of that architecture, scaled down to the operational and runtime layer.&lt;/p&gt;

&lt;p&gt;Code-writing agents need one set of artifacts: tests, types, lint, code review, mistake logs. Action-running agents need a different set: dry-runs, blast-radius limits, staging shadow data, change windows, proof chains, threshold gating. Both kinds of agent share the underlying move — supervision lives in the system, not in the operator's head. Different actors need different artifacts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The test, generalized
&lt;/h2&gt;

&lt;p&gt;The implicit question I ask whenever someone attributes an outage to "AI making mistakes" is this. Could a human have done this damage in this much time? If yes, the actor is not the problem and the safeguards are missing. If no, then this is a new class of risk and needs a new class of safeguard.&lt;/p&gt;

&lt;p&gt;Most of what gets blamed on the model passes the first test. A model called a destructive endpoint that should not have existed. A model committed a key that should have been gitignored. A model wrote SQL that a human reviewer should have caught. In all of those, the failure is upstream of the model.&lt;/p&gt;

&lt;p&gt;PocketOS fails the second test. A human could not have deleted prod and backups in nine seconds. That is genuinely a new class of risk, and it requires the artifact list above — not because the model is malicious (the agent's own confession shows it knew exactly which rules it was breaking), but because the model is &lt;em&gt;fast&lt;/em&gt;. Speed is the new vector. The artifacts have to handle it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;Stop blaming the model. Then look at the infrastructure. Then look at the &lt;em&gt;agent-specific&lt;/em&gt; infrastructure, because the human-era pipeline does not cover the speed and blast radius of an agent that can call tools faster than you can read its output. That last part is on us to build, and it is not where the field is putting its effort yet.&lt;/p&gt;

&lt;p&gt;Step one: the model is not the problem.&lt;/p&gt;

&lt;p&gt;Step two: build the pipeline. The 2010s did this for human deploys.&lt;/p&gt;

&lt;p&gt;Step three: the pipeline has to be agent-shaped. That step is open.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>devops</category>
      <category>aitooling</category>
    </item>
    <item>
      <title>Liu's 4 Lines Are the Floor. Build the Ceiling.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Mon, 04 May 2026 19:52:55 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/lius-4-lines-are-the-floor-build-the-ceiling-2862</link>
      <guid>https://dev.to/michaeltuszynski/lius-4-lines-are-the-floor-build-the-ceiling-2862</guid>
      <description>&lt;p&gt;Yanli Liu's "&lt;a href="https://levelup.gitconnected.com/the-4-lines-every-claudemd-needs-from-andrej-karpathys-thread-on-ai-coding-agents-d3eb19eecdf5" rel="noopener noreferrer"&gt;The 4 Lines Every CLAUDE.md Needs&lt;/a&gt;" makes a real point. The 4 lines, derived from &lt;a href="https://x.com/karpathy/status/2015883857489522876" rel="noopener noreferrer"&gt;Andrej Karpathy's January 2026 thread&lt;/a&gt; on agent failure modes, all express the same insight: behavioral rules outperform feature rules. &lt;em&gt;Don't assume. Surface tradeoffs.&lt;/em&gt; &lt;em&gt;Minimum code that solves the problem.&lt;/em&gt; &lt;em&gt;Touch only what you must.&lt;/em&gt; &lt;em&gt;Define success criteria. Loop until verified.&lt;/em&gt; Each one is portable across stacks and tasks, where prescriptive rules go stale the moment your codebase shifts.&lt;/p&gt;

&lt;p&gt;The 4 lines are the floor of a working CLAUDE.md. They are not the ceiling. Most of the CLAUDE.md files I see in the wild — including the ones the article holds up as cautionary tales of "47 rules about code style" — fail because they treat a file as the unit of organization. A production CLAUDE.md is an architecture, not a file.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the article gets right and what to flag
&lt;/h2&gt;

&lt;p&gt;The behavioral-vs-prescriptive distinction is correct, and the Configuration Paradox is real: past a threshold, more rules produce confused agents, not disciplined ones. Liu's litmus test — &lt;em&gt;would removing this cause a mistake the agent couldn't recover from?&lt;/em&gt; — is the right filter for any individual rule.&lt;/p&gt;

&lt;p&gt;A few things in the piece do not hold up under inspection. The asserted 6,000 / 12,000 character caps for CLAUDE.md have no source I can verify. The "/plugin marketplace add" command described in the article is not part of base Claude Code. The 94% accuracy stat the piece borrows from another blog has no disclosed methodology. And the "60,000 GitHub stars" figure cited as evidence of Claude Code adoption is unverified. Cite the article for the framing. Do not cite it for the numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4 lines do not stand alone for long
&lt;/h2&gt;

&lt;p&gt;Behavioral rules are the right starting point. They are also incomplete the moment you have a real project. You quickly need three other things the 4 lines do not give you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain context the agent cannot infer from files&lt;/strong&gt; — what each service does, why a directory is named the way it is, which APIs are read-only vs. write-side, where secrets live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture decisions&lt;/strong&gt; — patterns the agent shouldn't have to re-derive on every task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident-driven rules&lt;/strong&gt; — the corrections that came out of specific failures, with enough context that the rule is unambiguous.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you put all three of these into one CLAUDE.md, you get the 47-rule sprawl Liu warns against. If you leave them out, the agent guesses and the 4 lines do not help — &lt;em&gt;don't assume&lt;/em&gt; is a behavior, not a fact.&lt;/p&gt;

&lt;p&gt;The fix is structural. Stop accumulating rules in one file. Start delegating them to files with single jobs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the architecture looks like in practice
&lt;/h2&gt;

&lt;p&gt;NEXUS — my Claude Code operating layer — runs about 237 lines of CLAUDE.md. That file holds behavioral guardrails and protocols, and almost nothing else. The first two protocols there are &lt;em&gt;Verify Before Reporting&lt;/em&gt; and &lt;em&gt;Plan First, Code Second.&lt;/em&gt; Both are extensions of the same behavioral category Liu names. Adding fourteen more behavioral protocols at the same level still does not approach 47 rules of code style — they are the same shape as the 4 lines, just covering more failure modes.&lt;/p&gt;

&lt;p&gt;What CLAUDE.md does not contain is the project-specific stuff. That lives in delegated files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MEMORY.md&lt;/code&gt;&lt;/strong&gt; holds 21 numbered, dated, append-only Hard-Won Lessons. Each one came from a specific incident, with the cost of getting it wrong in the entry. &lt;em&gt;LaunchAgent log paths must be on local disk, not SMB&lt;/em&gt; (lesson #15) is in there because six of my LaunchAgents silently broke on 2026-04-19 when the path was on a NAS mount. The agent reads MEMORY.md at session start. I wrote about &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;the Mistakes Become Rules pattern last week&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.claude/rules/&lt;/code&gt;&lt;/strong&gt; holds language-specific and capability-specific rule files. &lt;code&gt;python.md&lt;/code&gt; for Python work. &lt;code&gt;completeness.md&lt;/code&gt; for "what counts as done." Each file gets loaded when the agent enters that context, not on every session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt;&lt;/strong&gt; for per-system context — finance, content, the DeFi system before it was retired. CLAUDE.md's session-startup protocol tells the agent &lt;em&gt;if a specific domain is in play, read the relevant &lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt;&lt;/em&gt;. The agent doesn't load all of them up front. It loads the one that matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SESSION-STATE.md&lt;/code&gt;&lt;/strong&gt; holds ephemeral active context — what's in flight, what was decided yesterday, what to pick up from. It is the first thing rewritten when a major task closes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the architecture. Behavioral guardrails at the top, in one shared file. Project-, domain-, and incident-specific rules delegated to files with one trigger condition each. The agent reads what's relevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The structural version of Liu's litmus test
&lt;/h2&gt;

&lt;p&gt;Liu's &lt;em&gt;would removing this cause a mistake the agent couldn't recover from&lt;/em&gt; is the right filter for an individual rule. The structural question is: &lt;em&gt;does this rule belong here, or in a delegated file?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Three quick filters answer that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If it changes per-project, it does not belong in CLAUDE.md. Put it in &lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt; or a project-specific file.&lt;/li&gt;
&lt;li&gt;If it changes per-language or per-tool, it does not belong in CLAUDE.md. Put it in &lt;code&gt;.claude/rules/&amp;lt;language&amp;gt;.md&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If it came from a real incident with a date and a cost, it does not belong in CLAUDE.md either. Put it in MEMORY.md's Hard-Won Lessons.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What's left in CLAUDE.md is the part that's behavioral, portable, and load-bearing. That tends to be a few dozen entries — bigger than 4, smaller than 47. Each entry is one short paragraph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this scales
&lt;/h2&gt;

&lt;p&gt;Two reasons. First, every file has one update protocol. Hard-Won Lessons are append-only and triggered by corrections. Domain contexts get rewritten when systems change. Behavioral protocols change rarely, and when they do, the change applies everywhere. Mixing them in one file forces every edit to sit next to every other edit, which is how you end up with the 47-rule mess.&lt;/p&gt;

&lt;p&gt;Second, the agent's working set at any decision point is smaller. A CLAUDE.md sized for the worst case is a CLAUDE.md the agent has to re-read every time. A CLAUDE.md sized for the always-true case, with delegated files for the contextual case, is one the agent can hold internally — and only loads the rest when the work demands it. This is the same logic I argued for &lt;a href="https://www.mpt.solutions/agentic-coding-isnt-the-trap-supervising-from-your-head-is/" rel="noopener noreferrer"&gt;supervision artifacts in the Faye reframe&lt;/a&gt; yesterday: institutional memory belongs in files with single owners and lifecycles, not in one file with many.&lt;/p&gt;

&lt;p&gt;Anthropic's &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Building Effective Agents&lt;/a&gt; framing draws the same line at the workflow level — predefined paths for the deterministic part, agent autonomy at the seams. The same shape applies to CLAUDE.md. The behavioral floor is the predefined part. The delegated files are the seams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the architecture still needs help
&lt;/h2&gt;

&lt;p&gt;This pattern does not solve everything. Multi-file refactors still need real architecture context the agent cannot derive from reading source. Regulated industries — Fulcrum, the presales workflow stack I run for enterprise customers, lives here — need domain-specific guardrails alongside the behavioral ones, and those guardrails are themselves a maintained artifact, not a one-time rule list. Team-scale consistency is a coordination problem, not a configuration one — the architecture gets you a reproducible shape, but multiple humans still have to agree on which lessons are real lessons.&lt;/p&gt;

&lt;p&gt;Tool portability is the last gap. The 4 lines transfer between Claude Code, Cursor, Codex, and others. The delegated file pattern transfers in shape but not in syntax — every agent has its own loading model. That is a real limitation. It is also a smaller limitation than starting from scratch on every tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take from Liu
&lt;/h2&gt;

&lt;p&gt;The 4 lines are the right floor. Behavioral rules over feature rules. Universal categories over project specifics. The Configuration Paradox is a thing to design against, not just a thing to know.&lt;/p&gt;

&lt;p&gt;The ceiling is the architecture above the floor. Behavioral guardrails in one shared file. Project, domain, and language rules delegated. Incident-driven rules in an append-only file the agent reads at session start. CLAUDE.md as the dispatcher, not the rulebook.&lt;/p&gt;

&lt;p&gt;Most CLAUDE.md files I see are stuck on the floor or buried under a 47-rule pile. The architecture is the move that gets you out of both.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Agentic Coding Isn't the Trap. Supervising From Your Head Is.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Mon, 04 May 2026 04:31:23 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/agentic-coding-isnt-the-trap-supervising-from-your-head-is-4i70</link>
      <guid>https://dev.to/michaeltuszynski/agentic-coding-isnt-the-trap-supervising-from-your-head-is-4i70</guid>
      <description>&lt;p&gt;Lars Faye's "&lt;a href="https://larsfaye.com/articles/agentic-coding-is-a-trap" rel="noopener noreferrer"&gt;Agentic Coding is a Trap&lt;/a&gt;" is the most honest writing I've seen on AI skill atrophy. The studies he cites are real. The "supervision paradox" — needing the skills the agent erodes to oversee it — is the cleanest framing of the failure mode I've read. I want to push on the conclusion, not the diagnosis.&lt;/p&gt;

&lt;p&gt;The Anthropic study Faye references — "&lt;a href="https://www.anthropic.com/research/AI-assistance-coding-skills" rel="noopener noreferrer"&gt;How AI Assistance Impacts the Formation of Coding Skills&lt;/a&gt;" — found a 17% drop in skill mastery for developers using AI assistance, with debugging showing the steepest decline. That's the headline number. But the same study also found something that gets quoted less often. Developers who used AI for conceptual inquiry scored 65% or higher on the follow-up evaluation. Developers who delegated code generation to the model scored below 40%.&lt;/p&gt;

&lt;p&gt;That gap — 65 versus 40, on the same tool and the same task — is the entire game.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the same study actually shows
&lt;/h2&gt;

&lt;p&gt;The variable that drove the difference wasn't whether the developer used the agent. It was how they supervised the work. The high-scoring group asked follow-up questions, combined generation with explanation, used the model for conceptual gaps and not code-shaped output. The low-scoring group accepted what the model produced and moved on. Same tool. Two completely different supervision patterns. Two completely different outcomes.&lt;/p&gt;

&lt;p&gt;Faye treats the headline 17% as evidence the tool is the problem. The 65/40 split inside the same paper says the supervision pattern is the problem. Those are different conclusions, and they call for different fixes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap is the supervision pattern
&lt;/h2&gt;

&lt;p&gt;Faye's prescription is to demote the AI: write pseudo-code by hand, treat the model as a "Ship's Computer not Data," never delegate work you haven't done yourself. The implicit move is to relocate as much of the work back into the developer's head as possible, on the theory that the head is where supervision capacity has to live.&lt;/p&gt;

&lt;p&gt;That theory is where I want to push.&lt;/p&gt;

&lt;p&gt;The supervision paradox bites for one reason. The developer is being asked to be the entire supervisory apparatus, by themselves, in real time, using only working memory and personal vigilance. That fails. It fails the same way it fails for a senior engineer reviewing a 4,000-line PR from a junior at 4pm on a Friday. The bottleneck isn't the code. It's the cognitive substrate the reviewer is using.&lt;/p&gt;

&lt;p&gt;Anything you don't exercise daily fades. If your supervision is "I personally read every line and hold the whole system in my head," then yes — once an agent writes more lines than you can read, you lose. Atrophy is the symptom. Personal vigilance as the supervision strategy is the part worth examining.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move supervision out of your head
&lt;/h2&gt;

&lt;p&gt;The fix that the 65% group implicitly used is not to type more code. It's to put supervision in places that don't atrophy.&lt;/p&gt;

&lt;p&gt;That list is short and well-known:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tests&lt;/strong&gt; that fail when the contract breaks. Not coverage theater — real assertions on the edges that matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Types&lt;/strong&gt; that refuse to compile when the shape is wrong. The compiler does not get tired at 4pm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lint and format rules&lt;/strong&gt; that catch the patterns you keep correcting by hand. If you've corrected the same pattern twice, lint it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hooks&lt;/strong&gt; at the runtime layer. &lt;a href="https://code.claude.com/docs/en/hooks" rel="noopener noreferrer"&gt;Claude Code's PreToolUse and SessionStart hooks&lt;/a&gt; run deterministically — the model can't forget them. The set of rules that are regex-shaped and load-bearing belong here, not in a system prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code review&lt;/strong&gt; as the final gate. Same discipline humans have used to supervise other humans' code for fifty years. It works on agent output for the same reason it worked on junior output: the reviewer doesn't need to have written the code, they need to be able to defend it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append-only mistake logs.&lt;/strong&gt; &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;The Mistakes Become Rules pattern&lt;/a&gt; — one numbered file, the agent reads it at session start, every correction becomes a permanent entry. The supervision lives in the file, not in the next reviewer's recall.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these is institutional memory. None of them depends on a single developer holding the whole system in working memory. All of them survive the developer taking three weeks off.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real test
&lt;/h2&gt;

&lt;p&gt;Here is the question that separates the two groups in the Anthropic study, generalized.&lt;/p&gt;

&lt;p&gt;Take three weeks off. An agent does the work in your absence, given only the repo, the tests, the lint, the hooks, the mistake log, and the review process. When you come back, is the codebase in a state you can defend?&lt;/p&gt;

&lt;p&gt;If yes, supervision lives in artifacts. The agent is being supervised by the system you put in place, not by your personal vigilance. Atrophy of your typing speed is not a threat, because typing was never the supervision mechanism.&lt;/p&gt;

&lt;p&gt;If no, the artifacts aren't there yet. Personal vigilance is the only thing standing between the codebase and chaos, and Faye's prescription is the right safety move for that situation. Demote the agent. Build the artifacts before you raise it back up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Ship's Computer, not Data" is too narrow
&lt;/h2&gt;

&lt;p&gt;Faye's analogy locates judgment in one captain's head. That framing is the same shape as the paradox — supervision as a personal cognitive feat. It quietly assumes the developer is alone with the tool.&lt;/p&gt;

&lt;p&gt;A different shape works better. The agent is a junior — fast, eager, occasionally confidently wrong, requires review. You are the senior. You don't supervise by re-typing the junior's work. You supervise by reading the diff, running the tests, checking it against the team's accumulated rules, and asking the junior to defend choices you don't understand. Anthropic's own &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Building Effective Agents&lt;/a&gt; framing assumes exactly this division of labor — the human owns the seams, the agent owns the steps between them. I made the same point about &lt;a href="https://www.mpt.solutions/stop-turning-your-cron-jobs-into-agents/" rel="noopener noreferrer"&gt;agency belonging at judgment seams&lt;/a&gt; when arguing against turning cron jobs into agents. The shape matches.&lt;/p&gt;

&lt;p&gt;Senior engineers do not atrophy by not typing. They atrophy by not reviewing critically. That distinction is most of the game.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Faye gets right that I'm not arguing with
&lt;/h2&gt;

&lt;p&gt;Vendor lock-in is real. Token costs are unpredictable. Outages happen. Probabilistic systems require review cycles that deterministic ones don't. None of those go away in this reframe.&lt;/p&gt;

&lt;p&gt;But they're risks to manage, not reasons to put supervision back in your head. You manage vendor risk with model-agnostic runtimes and the kind of prompts, skills, and hooks that move between models. You manage token cost with caching and tier discipline. You manage outages by having work that doesn't depend on a single API call to make progress. None of that is "type more code by hand."&lt;/p&gt;

&lt;h2&gt;
  
  
  The shorter version
&lt;/h2&gt;

&lt;p&gt;Skill atrophy under heavy agent use is real, and Faye is right to take it seriously. The skill that atrophies fastest is "personal vigilance as a supervision strategy," and that strategy was under pressure at scale long before agents existed. Agents accelerate it.&lt;/p&gt;

&lt;p&gt;The fix isn't only to demote the agent. It's also — and mostly — to promote the artifacts. Put the supervision in places that don't get tired, don't forget, and don't need to be re-derived from working memory every Tuesday morning. The 65% group in the Anthropic study were already doing this, even if the paper didn't name it that way.&lt;/p&gt;

&lt;p&gt;The trap isn't agentic coding. The trap is treating supervision as a thing that lives inside one developer's head. Move it out, and the paradox eases.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>claudecode</category>
      <category>platformengineering</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Your Agent's Compliments Are a Confession</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Sun, 03 May 2026 00:04:08 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/your-agents-compliments-are-a-confession-3kdj</link>
      <guid>https://dev.to/michaeltuszynski/your-agents-compliments-are-a-confession-3kdj</guid>
      <description>&lt;p&gt;Count how many times your agent told you "you're right" today. Count "good catch." Count "I should have noticed that." Now ask yourself how many of those corrections will survive into tomorrow's session.&lt;/p&gt;

&lt;p&gt;The compliments are not praise. They are a confession. Every "you're right" is the agent admitting it just learned something it should have already known, in a context that will evaporate the moment the session ends. The data point is real. The retention is zero.&lt;/p&gt;

&lt;p&gt;This is the actual problem people are trying to solve when they reach for elaborate self-improvement architectures: nightly reflection cron jobs, background agents that crawl yesterday's transcripts, autonomous proposal pipelines with grading subagents and dashboards for human review. The instinct is right. The solution is theater.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;Claude Code's runtime, like any agent runtime, starts each session from a fresh conversation. &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Anthropic's own framing for effective agents&lt;/a&gt; draws a line between workflows (predefined paths) and agents (LLMs deciding their own tool use). Both reset. The lesson you taught your agent at 3pm is encoded in the message history of that one conversation. Tomorrow morning, that history is gone. The model is the same model. The instructions in CLAUDE.md are the same instructions. But the specific correction — "no, on this codebase you have to use the absolute path because launchd reset PATH on you" — lives only in the transcript.&lt;/p&gt;

&lt;p&gt;So you correct it again. And again. And the third time you notice the pattern, you start looking for a fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tempting wrong answer
&lt;/h2&gt;

&lt;p&gt;The fix that gets blogged about goes something like this: build a nightly cron job that reads yesterday's transcripts, extracts candidate lessons, drafts them as JSON proposals with frontmatter, opens a dashboard, and asks a separate grading subagent to score the proposals. Human reviews. Promotes accepted ones into a "skill" file. Repeat.&lt;/p&gt;

&lt;p&gt;This is ceremony, not discipline. Three problems with it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It substitutes process metrics for outcome.&lt;/strong&gt; You can run the pipeline every night and ship zero durable improvement. The metric you actually care about is "did the agent stop making the same mistake," not "did we generate ten proposals last week."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It moves the work from the moment that matters.&lt;/strong&gt; The right time to write the rule is the moment you notice the agent got it wrong. Not eight hours later, after a reflection agent has interpreted what it thought happened.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It puts a model in front of the file.&lt;/strong&gt; The whole reason you're writing this down is that the model is the unreliable component. Layering more model-mediated steps on top of "remember this" is the architectural equivalent of asking the goldfish to file its own memos.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The runtime layer matters. So does the substrate. None of it replaces the rule.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actually-working answer
&lt;/h2&gt;

&lt;p&gt;A single file. The agent reads it at the start of every session. Append-only. Numbered. Dated. Linked to the actual incident.&lt;/p&gt;

&lt;p&gt;NEXUS — my agent setup, specifically the operating layer that wraps Claude Code on my machine — formalizes this in CLAUDE.md as a behavioral protocol called Mistakes Become Rules. The wording is exact and short:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Trigger:&lt;/strong&gt; Any time Mike corrects your approach, points out an error, or says something like "no, not that" / "don't do X" / "you should have…"&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Immediately add a numbered entry to MEMORY.md's "Hard-Won Lessons" section: &lt;code&gt;[next number]. **[short title]** — [what went wrong and the rule to follow going forward].&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;On session start:&lt;/strong&gt; Read and internalize all Hard-Won Lessons before beginning work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the entire loop. There is no reflection agent. There is no nightly job. There is no dashboard. The trigger is the correction itself, the action is one append, and the agent reads the file the next time it boots up.&lt;/p&gt;

&lt;p&gt;The file currently has twenty entries. Each one came from a specific incident, on a specific date, that cost me time. A few of them, with the context that made them rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lesson #15 — LaunchAgent log paths must be on local disk, not SMB.&lt;/strong&gt; On 2026-04-19, six LaunchAgents in my finance service silently broke. macOS TCC was blocking launchd-spawned processes from writing logs to the NAS-mounted path, even though the same SSH user could write there fine. Exit code 78. No log output, because the log path was the problem. Took an afternoon to diagnose. The rule is one sentence. The rule writes itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lesson #19 — Never &lt;code&gt;import()&lt;/code&gt; a publish script "to test it."&lt;/strong&gt; On 2026-04-29, two test imports of &lt;code&gt;publish-agent-id-role.ts&lt;/code&gt; raced because the script invokes &lt;code&gt;main()&lt;/code&gt; at module top-level. Result: duplicate posts on LinkedIn (twice), X (twice), and Ghost (one extra, deleted via Admin API). Late.dev refuses to delete already-published content, so the cleanup was manual. The rule: validate publish scripts with &lt;code&gt;tsc --noEmit&lt;/code&gt;, a &lt;code&gt;--dry-run&lt;/code&gt; flag, or by reading them. Never with &lt;code&gt;import()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lesson #20 — PM2 &lt;code&gt;script: "npm"&lt;/code&gt; ignores app &lt;code&gt;env.PATH&lt;/code&gt;.&lt;/strong&gt; On 2026-05-01, the health-api service kept reporting &lt;code&gt;online&lt;/code&gt; while the port wasn't listening. PM2 was launching &lt;code&gt;npm&lt;/code&gt; from the daemon's PATH, not the app's, which meant &lt;code&gt;better-sqlite3&lt;/code&gt; (compiled for node 22) was loading under node 25 and crashing on &lt;code&gt;ERR_DLOPEN_FAILED&lt;/code&gt;. Fix: pin &lt;code&gt;script&lt;/code&gt; to the absolute path of the desired npm. Same idea as Lesson #16, now for PM2 instead of launchd.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each entry took less than a minute to write. Each one prevents the same hour-long failure from happening twice. The compounding is the entire point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where hooks fit
&lt;/h2&gt;

&lt;p&gt;Lessons live in markdown because that's how the agent absorbs them at session start. But there's a runtime layer underneath, and it has a real role.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://code.claude.com/docs/en/hooks" rel="noopener noreferrer"&gt;Claude Code's hooks&lt;/a&gt; — PreToolUse, PostToolUse, SessionStart, and friends — let you intercept tool calls deterministically. If a lesson can be reduced to a regex on a command string ("never run &lt;code&gt;rm -rf&lt;/code&gt; outside &lt;code&gt;/tmp&lt;/code&gt;"), a hook is a better enforcement point than a markdown bullet, because the markdown bullet relies on the model reading and obeying it. The hook does not.&lt;/p&gt;

&lt;p&gt;Same logic for &lt;a href="https://code.claude.com/docs/en/skills" rel="noopener noreferrer"&gt;Claude Code Skills&lt;/a&gt;: they're great for packaging a procedure with its own tools and supporting files. They are not a substitute for the rule. They're a substrate the rule can sit on top of.&lt;/p&gt;

&lt;p&gt;The hierarchy I run with: durable rules in the markdown file, deterministic enforcement in hooks where the rule is regex-shaped, and skills for procedures with multiple steps. None of those is a self-improvement loop. None of them runs at midnight. None of them has a grading subagent. They are all read or executed at the moment they apply.&lt;/p&gt;

&lt;h2&gt;
  
  
  How you know it's working
&lt;/h2&gt;

&lt;p&gt;The test is simple. You stop hearing the same compliment twice.&lt;/p&gt;

&lt;p&gt;If your agent says "good catch" today, look it up tomorrow. Is the lesson in your file? Did the agent read the file before it started working? If yes to both, you should never hear "good catch" on that specific topic again. If you do, the rule is wrong, the file isn't being read, or the lesson didn't generalize. All three are debuggable. None of them require a reflection agent.&lt;/p&gt;

&lt;p&gt;Praise without persistence is a leak. Patch the leak, do not build a recycling system for the runoff.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>claudecode</category>
      <category>platformengineering</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Three Memory Systems Under One Login. Stop Picking Sides.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Sun, 03 May 2026 00:01:37 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/three-memory-systems-under-one-login-stop-picking-sides-1ela</link>
      <guid>https://dev.to/michaeltuszynski/three-memory-systems-under-one-login-stop-picking-sides-1ela</guid>
      <description>&lt;p&gt;Anthropic now ships at least three different memory models inside the Claude product family, and they don't behave the same way. Claude.ai has &lt;a href="https://claude.com/blog/memory" rel="noopener noreferrer"&gt;a chat memory feature for Pro, Max, Team, and Enterprise users&lt;/a&gt; that summarizes prior conversations and injects that summary into new chats. Claude Code has &lt;a href="https://code.claude.com/docs/en/memory" rel="noopener noreferrer"&gt;CLAUDE.md files plus a separate "auto memory" directory&lt;/a&gt; the model writes to itself, both loaded at session start. The API ships &lt;a href="https://docs.claude.com/en/docs/agents-and-tools/tool-use/memory-tool" rel="noopener noreferrer"&gt;a &lt;code&gt;memory_20250818&lt;/code&gt; tool&lt;/a&gt; that hands a &lt;code&gt;/memories&lt;/code&gt; directory to your application code so you can persist anything you want between turns. Three surfaces, three rule sets, three retention postures.&lt;/p&gt;

&lt;p&gt;I argued last week on this blog that the model isn't the variable that matters — the wrapper around it is. This is the next claim down the chain: if memory is a feature of that wrapper rather than the model, then vendor fragmentation is a memory problem you cannot solve by picking a surface. Stop trying.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's actually different across the three
&lt;/h2&gt;

&lt;p&gt;The chat surface remembers conversations as a 24-hour synthesis, project-scoped, controllable through a settings panel. The Code surface uses plain markdown files in your repo plus a per-project memory directory at &lt;code&gt;~/.claude/projects/&amp;lt;project&amp;gt;/memory/&lt;/code&gt; on the local machine. The API tool defines six file operations (&lt;code&gt;view&lt;/code&gt;, &lt;code&gt;create&lt;/code&gt;, &lt;code&gt;str_replace&lt;/code&gt;, &lt;code&gt;insert&lt;/code&gt;, &lt;code&gt;delete&lt;/code&gt;, &lt;code&gt;rename&lt;/code&gt;) and expects your application to implement the storage. None of these are wrong. They are designed for different jobs. But they share zero common format, no export path between them, and no way to carry context from a Claude.ai chat into a Claude Code session into an API agent without doing the plumbing by hand.&lt;/p&gt;

&lt;p&gt;Birgitta Böckeler's writeup on &lt;a href="https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html" rel="noopener noreferrer"&gt;context engineering for coding agents&lt;/a&gt; frames the wrapper as everything in an AI agent except the model itself: the tool definitions, the context compaction, the feedback sensors, the system prompt, the memory between sessions. Anthropic's own engineering team &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;calls the same idea context engineering&lt;/a&gt; — the work of curating what enters the model's attention budget at each step. Memory sits squarely inside that definition. Which means the choice about &lt;em&gt;where memory lives&lt;/em&gt; is a wrapper decision, and the vendor is making it for you on each surface until you take it back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap
&lt;/h2&gt;

&lt;p&gt;The natural reaction when a vendor ships three memory models is to figure out which one to use. Spend an afternoon reading docs, decide that the chat synthesis is for ad hoc queries, the auto memory is for coding work, and the API tool is for production agents. Move on.&lt;/p&gt;

&lt;p&gt;That reaction is wrong, and not because any of those choices are bad in isolation. It's wrong because it assumes vendor surfaces are stable. They aren't. Claude.ai's memory was Team-and-Enterprise-only at launch in September 2025, then expanded to Pro and Max in October. Claude Code's auto memory requires v2.1.59 or later and lives in a path tied to the git repo, not the user. The API memory tool is in beta under a header that already changed naming conventions twice. The vendor will keep shipping, the rules will keep shifting, and your context will keep being a second-class object inside someone else's roadmap.&lt;/p&gt;

&lt;p&gt;There's also a deeper problem. MindStudio's writeup on &lt;a href="https://www.mindstudio.ai/blog/what-is-behavioral-lock-in-persistent-ai-agents-switching-costs" rel="noopener noreferrer"&gt;behavioral lock-in&lt;/a&gt; makes the case that agent memory creates switching costs that data portability rules cannot fix. For example, even if a vendor lets you export your memory directory tomorrow, the operational understanding the agent built — your team's terminology, your exceptions, your shorthand — does not round-trip cleanly into another vendor's surface. Eight months of accumulated context turns into a re-onboarding tax the moment you switch. Parallels' 2026 cloud survey put vendor lock-in concern at 94% across 540 IT leaders; agent memory is exactly the layer where that concern compounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I do instead
&lt;/h2&gt;

&lt;p&gt;My memory lives in a NAS-backed directory called &lt;code&gt;nexus/&lt;/code&gt;, in plain markdown, under git. It has a top-level &lt;code&gt;CLAUDE.md&lt;/code&gt; that gets auto-loaded into every Claude Code session because it sits at the project root. It has a &lt;code&gt;MEMORY.md&lt;/code&gt; for long-term curated state, a &lt;code&gt;SESSION-STATE.md&lt;/code&gt; for active context, per-domain context files at &lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt; for finance, health, content, and so on, and daily logs at &lt;code&gt;memory/YYYY-MM-DD.md&lt;/code&gt;. Cross-references between entities use &lt;code&gt;[[double brackets]]&lt;/code&gt; so they're grep-searchable and Obsidian-renderable. Search across the corpus runs through an Ollama embedding pipeline using &lt;code&gt;nomic-embed-text&lt;/code&gt; at 768 dimensions, indexed locally — no vendor API call required to ask "what did I decide about that account fee in February?"&lt;/p&gt;

&lt;p&gt;This stack does three things the vendor surfaces cannot.&lt;/p&gt;

&lt;p&gt;First, it survives the surface split. The same files load into Claude Code, can be pasted into Claude.ai, and can be served to an API agent through the memory tool's file ops. The format is universal because the format is just files.&lt;/p&gt;

&lt;p&gt;Second, it survives the vendor switch. If I move to a different model provider tomorrow, the markdown still parses, the embeddings still resolve, and the wikilinks still work. There is no proprietary memory schema to migrate.&lt;/p&gt;

&lt;p&gt;Third, it gives me audit. I can grep my own context. I can diff what changed last week. I can &lt;code&gt;trash&lt;/code&gt; something I don't want anymore and recover it if I was wrong. None of those operations exist on the chat memory surface, and they only partially exist on the auto-memory surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  The general pattern
&lt;/h2&gt;

&lt;p&gt;The vendor's wrapper is not your wrapper. It's theirs, designed around their product roadmap and their retention model and their billing surfaces. When that wrapper includes a memory layer, putting your context in it means putting your operational knowledge in someone else's container. Fine for ephemeral chat. Not fine for the accumulated state of a year of work.&lt;/p&gt;

&lt;p&gt;The fix is not to pick the right vendor surface. The fix is to keep your memory outside any vendor surface, in a format you own, with search you control, and let the vendor surfaces read from it as needed. Claude Code already does this for free with &lt;code&gt;CLAUDE.md&lt;/code&gt;. The other surfaces will eventually catch up, or they won't, and either way your context survives.&lt;/p&gt;

&lt;p&gt;Last week's post argued that the wrapper around the model is what matters. This one finishes the sentence: don't trust theirs with your context.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>aiagents</category>
      <category>vendorlockin</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Stop Adopting AI. Start Exposing Your Context.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Fri, 01 May 2026 20:50:12 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/stop-adopting-ai-start-exposing-your-context-2pog</link>
      <guid>https://dev.to/michaeltuszynski/stop-adopting-ai-start-exposing-your-context-2pog</guid>
      <description>&lt;p&gt;The AI adoption pathway that's actually working in 2026 is not "deploy a copilot to your team." It's "expose your org's context to whichever model your team already chose." That sounds like a small shift. It's not. It changes who picks the tool, what your procurement team buys, and where the work of getting value out of AI actually lives.&lt;/p&gt;

&lt;p&gt;The numbers behind the shift are bleak for the old playbook. MIT's NANDA study of 300 enterprise AI deployments found 95% of GenAI pilots delivered no measurable P&amp;amp;L impact. The diagnosis was not the model. It was missing context — the data, workflow knowledge, and institutional memory the model needed to actually be useful inside a specific business. &lt;a href="https://atlan.com/know/context-engineering-framework/" rel="noopener noreferrer"&gt;Atlan summarizes the same finding&lt;/a&gt; and quotes Box CEO Aaron Levie, who calls context engineering "the long pole in the tent for AI Agents adoption in most organizations." Gartner went further in mid-2025: "context engineering is in, prompt engineering is out," with a prediction that 80% of AI tools will incorporate it by 2028.&lt;/p&gt;

&lt;p&gt;Klarna is the worked example everyone now points at. Between 2022 and 2024, the company replaced about 700 customer-service positions with an OpenAI-powered chatbot. By spring 2025 customer satisfaction had dropped 22% and complaints had piled up. &lt;a href="https://www.entrepreneur.com/business-news/klarna-ceo-reverses-course-by-hiring-more-humans-not-ai/491396" rel="noopener noreferrer"&gt;The CEO publicly admitted the cuts went too far&lt;/a&gt; and pivoted to a hybrid model, rehiring humans for anything requiring judgment. The model wasn't broken. The pathway was. The org rolled out an agent without exposing the context it needed — refund policies, payment edge cases, regional regulations, escalation patterns — and the agent shipped generic answers to specific problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What replaced it
&lt;/h2&gt;

&lt;p&gt;Three things converged in late 2025 that quietly killed the old pathway.&lt;/p&gt;

&lt;p&gt;The first is the &lt;strong&gt;Model Context Protocol&lt;/strong&gt;. Anthropic open-sourced MCP in November 2024; by March 2026 the SDK was hitting &lt;a href="https://thenewstack.io/why-the-model-context-protocol-won/" rel="noopener noreferrer"&gt;97 million monthly downloads&lt;/a&gt; — a 970x growth curve from launch. OpenAI, Microsoft, Google, and AWS all shipped MCP client support within thirteen months. An independent census in Q1 2026 indexed 17,468 servers across registries. MCP is not a model. It is a protocol for handing a model the right context — your Slack, your issue tracker, your observability stack, your customer database — at the moment of the request.&lt;/p&gt;

&lt;p&gt;The second is &lt;strong&gt;agent skills as a portable artifact&lt;/strong&gt;. &lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;Anthropic launched Agent Skills in October 2025&lt;/a&gt; and open-sourced the SKILL.md format in December. Atlassian, Canva, Cloudflare, Figma, Notion, Ramp, and Sentry all shipped skills in the launch window. A skill is a directory: instructions, scripts, resources. Drop the directory next to a workflow that recurs and any compatible agent can run it. The format is Anthropic's, but the spec is the same shape as .cursorrules, AGENTS.md, GitHub Spaces, and the rest of the convergence happening across vendors.&lt;/p&gt;

&lt;p&gt;The third is &lt;strong&gt;the in-repo memory file&lt;/strong&gt; as a de facto standard. CLAUDE.md, AGENTS.md, .cursorrules, and the rest are all the same idea: a markdown file at the root of a project that tells whatever agent gets dropped in what the project is, what conventions matter, what the gotchas are, and where the bodies are buried. The agent reads the file at the start of every session. The org documents itself once. The dev picks the model.&lt;/p&gt;

&lt;p&gt;Read those three together and the picture is obvious. The unit of AI adoption stopped being "the agent." It became "the substrate the agent stands on."&lt;/p&gt;

&lt;h2&gt;
  
  
  What that looks like in practice
&lt;/h2&gt;

&lt;p&gt;I run a personal agentic stack — NEXUS — that's been doing this for about a year. The repo has a CLAUDE.md at the root that lays out the workspace structure, identity, behavioral protocols, and lessons learned. There are a dozen agents/-context.md files for finance, content, health, the rest. There's an MCP server for Gmail, Calendar, Slack, Drive, and a few internal tools. There are skills for the recurring workflows — publishing a blog post, running a finance check, doing a health digest. The agent I happen to be using on a given day — Claude Code mostly, occasionally Cursor — reads what it needs at session start and gets to work.&lt;/p&gt;

&lt;p&gt;I don't pick a model and roll it out. I expose context, and whichever model is in the chair when I sit down knows what's going on.&lt;/p&gt;

&lt;p&gt;The same shape works at company scale, just with more access controls and an actual budget. The work is documenting the org until any agent dropped into it would be useful. The model becomes a free variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes about procurement
&lt;/h2&gt;

&lt;p&gt;The old AI procurement motion: pick a vendor, sign a per-seat contract, train the team on the tool, run change-management sessions, hope adoption hits 30%. This is what Klarna did. The asset created at the end of it is a vendor relationship and some training decks.&lt;/p&gt;

&lt;p&gt;The new motion: invest in the context infrastructure — an MCP gateway, a documentation platform that agents can read, semantic indexes for your wikis and tickets, a skills directory for recurring workflows. The model is whoever the dev or team picked. The procurement decision is &lt;em&gt;which surfaces to expose&lt;/em&gt;, not &lt;em&gt;which copilot to license&lt;/em&gt;. The asset created is a substrate that survives the next model rotation.&lt;/p&gt;

&lt;p&gt;The implication that nobody loves: tool-selection RFPs become a free variable rotation, not a strategic decision. The strategic decision is what your org has to say to a model that doesn't already know it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do this week
&lt;/h2&gt;

&lt;p&gt;Four moves if you want to test the pathway without committing to a vendor.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit your CLAUDE.md / AGENTS.md surface.&lt;/strong&gt; Drop a coding agent into your main repo with no other context. Ask it to make a non-trivial change. If it makes obvious mistakes — wrong test runner, ignored coding conventions, bypassed an internal review process — those are the gaps a memory file should close. Write that file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick three high-frequency workflows and write skills.&lt;/strong&gt; The kind of thing a senior engineer explains to a new hire in their first week. Convert each to a SKILL.md or an equivalent. Measure time-to-task before and after.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stand up an MCP gateway for your top three internal systems.&lt;/strong&gt; Issue tracker, observability, customer database. Most have community MCP servers already; the work is access control, not implementation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop running tool-selection RFPs.&lt;/strong&gt; Or if you have to, run them as a side track. The strategic work — and the asset that survives the next model release — is the context, not the contract.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The throughline
&lt;/h2&gt;

&lt;p&gt;The agentic adoption series has been running through failure modes. Part 1 was your team not trusting the agent. Part 2 was your customers not trusting the agent. The Cron-Not-Agents post was teams agentifying things that should have stayed deterministic. Last week's was the IAM seam — agent identities sharing primitives with everything else. This one is the answer to all of them.&lt;/p&gt;

&lt;p&gt;The pathway that works in 2026 is not adoption of a tool. It is exposure of a substrate. Once your org has the substrate, whatever model your team picks lands on something it can stand on. Without it, every rollout looks like Klarna's: an agent given a job, with no context for how the job is actually done, generating generic answers to specific problems and dropping CSAT 22 points before someone notices.&lt;/p&gt;

&lt;p&gt;Pick the context. The model is going to keep changing.&lt;/p&gt;

</description>
      <category>aiadoption</category>
      <category>contextengineering</category>
      <category>mcp</category>
      <category>platformengineering</category>
    </item>
  </channel>
</rss>
