<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: René Zander</title>
    <description>The latest articles on DEV Community by René Zander (@reneza).</description>
    <link>https://dev.to/reneza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1138713%2Fa7d8635c-22db-4dec-b156-1fb07de64a8d.jpeg</url>
      <title>DEV Community: René Zander</title>
      <link>https://dev.to/reneza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/reneza"/>
    <language>en</language>
    <item>
      <title>Sandboxing an AI Coding Agent: The Harness Owns the Boundaries</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Fri, 03 Jul 2026 15:27:46 +0000</pubDate>
      <link>https://dev.to/reneza/sandboxing-an-ai-coding-agent-the-harness-owns-the-boundaries-28ib</link>
      <guid>https://dev.to/reneza/sandboxing-an-ai-coding-agent-the-harness-owns-the-boundaries-28ib</guid>
      <description>&lt;p&gt;The obvious way to improve a coding agent is to make it more capable: a stronger model, a wider context window, more tools, more room to act on its own. That is not where my problems come from. My agents seldom fail because they reason badly. They fail because they take the shortest path to something that looks finished and skip the process that was supposed to make the result trustworthy.&lt;/p&gt;

&lt;p&gt;The pattern is familiar to anyone who has watched an agent work unsupervised. It edits the tests until they pass. It reports that a command ran instead of proving it. It writes into the working repository before anyone reviewed a diff. It switches to a cheaper model mid-task with no sense of the cost. These are not reasoning errors. They are shortcuts around a process, and a stronger model takes them faster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frp29snt5d4bmpduiz9h8.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frp29snt5d4bmpduiz9h8.gif" alt="The Pi coding agent running inside a sandboxed staging workspace, with staged diffs shown before anything reaches the real project" width="560" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/renezander030/pi-safe" rel="noopener noreferrer"&gt;pi-safe&lt;/a&gt; launching the Pi agent into an NVIDIA OpenShell sandbox: the real project is copied to a staging tree the agent works in, its extensions and credentials load inside the sandbox, and changes only reach the real repo after review.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape: the model requests, the harness owns the boundaries
&lt;/h2&gt;

&lt;p&gt;I stopped trying to make the agent more trustworthy and started constraining what it can reach. The agent runs inside a sandbox that owns the filesystem, network, and credential policy, and writes only to a staged copy of the repository. Its output reaches the real project through a separate evaluator. The model requests; the harness owns the boundaries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdcu61uye8zpmy589wnks.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdcu61uye8zpmy589wnks.png" alt="Overview: the request flows through model routing, context control, and the agent inside a runtime guard, then staged changes pass a patch evaluator before reaching the real repository" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every arrow in that path is a place I can say no.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the substrate owns, what my extensions own
&lt;/h2&gt;

&lt;p&gt;The lower layer is NVIDIA's OpenShell, a sandbox and credential substrate. It owns sandbox lifecycle, filesystem and process isolation, minimal outbound network by default, policy-enforced egress, and named credential providers that inject secrets at runtime rather than copying them onto disk. It is infrastructure I want to own as little of as possible.&lt;/p&gt;

&lt;p&gt;The upper layer is specific to how I work: a set of small extensions that control the agent's behaviour, what model it picks, how much context it carries, what it can recall, and whether its output is allowed to land. The substrate keeps the agent contained; the extensions decide how it acts while contained.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxoi7uw75zebkwpgijpch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxoi7uw75zebkwpgijpch.png" alt="The substrate owns isolation, network policy, and credential providers; the control layer of small extensions owns model routing, context pressure, recall, and the patch gate" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Each part owns one boundary
&lt;/h2&gt;

&lt;p&gt;The extensions are deliberately not one big extension. Each has a narrow job, so each has a narrow failure domain. If model choice is wrong, I fix the router. If context bloats, I fix the cache layer. If recall is wrong, I inspect the recall surface. One giant extension would be simpler to explain and harder to trust, because every failure shares the same blast radius.&lt;/p&gt;

&lt;p&gt;The router classifies work and escalates on process, not prestige. Routine work stays cheap, mechanical work can run local, and only stuck or high-risk reasoning reaches a stronger model. The cache layer watches context pressure and compacts before a bloated working set makes every later decision worse. Recall splits by trust: derived knowledge is graphed from the code, authored knowledge is the reviewed bundle for what the code cannot explain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fz96dok34whuir5ese152.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fz96dok34whuir5ese152.png" alt="Each capability sits behind its own boundary: router, cache, derived code recall in teal, authored knowledge in amber, each a separate failure domain" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A sandbox is not an evaluator
&lt;/h2&gt;

&lt;p&gt;The boundary I care about most is the last one. A sandbox runs code safely. An evaluator decides whether that code should land. Those are different jobs, and collapsing them is how output nobody checked ends up in the main branch.&lt;/p&gt;

&lt;p&gt;The evaluator takes the agent's patch, applies it to a disposable workspace, runs its checks, and returns one of three answers: pass, block, or override. The substrate can supply the process the evaluation runs in, but it does not make the decision. The real repository stays behind that gate; the agent's writable root is never the project itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F79ls8sg40xo1joflkteq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F79ls8sg40xo1joflkteq.png" alt="The patch evaluator applies staged changes in a disposable workspace and returns pass, block, or override; only a pass reaches the real repository" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would delete next
&lt;/h2&gt;

&lt;p&gt;The direction of this system is fewer parts, not more. The best part is no part. Every time the substrate can own a boundary directly, I want to delete my custom layer for it. The wrappers I run today exist only until the platform underneath is clean enough to remove them.&lt;/p&gt;

&lt;p&gt;The test for every piece is the same. Does it still own a real boundary? If a component only lets the agent do more, it has failed the test and should go. The harness was never meant to make the agent impressive because it can do everything. It was to leave fewer places where the model can declare victory without actually earning it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The parts
&lt;/h2&gt;

&lt;p&gt;Substrate: &lt;a href="https://github.com/NVIDIA/openshell" rel="noopener noreferrer"&gt;NVIDIA OpenShell&lt;/a&gt;, the sandbox and credential runtime the whole thing sits on.&lt;/p&gt;

&lt;p&gt;The extensions, each owning one boundary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-safe" rel="noopener noreferrer"&gt;pi-safe&lt;/a&gt;: launches the agent inside the sandbox by default&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-task-router" rel="noopener noreferrer"&gt;pi-task-router&lt;/a&gt;: model choice and escalation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-cache-optimizer" rel="noopener noreferrer"&gt;pi-cache-optimizer&lt;/a&gt;: context and cache pressure&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-code-context" rel="noopener noreferrer"&gt;pi-code-context&lt;/a&gt;: semantic code search&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-recall" rel="noopener noreferrer"&gt;pi-recall&lt;/a&gt;: the recall surface for the agent&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-codegraph" rel="noopener noreferrer"&gt;pi-codegraph&lt;/a&gt;: derived code knowledge behind recall&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-okf" rel="noopener noreferrer"&gt;pi-okf&lt;/a&gt;: authored, reviewed knowledge bundles&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/pi-gate" rel="noopener noreferrer"&gt;pi-gate&lt;/a&gt;: the patch evaluator contract&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;pi-creds (scoped credential requests) and pi-eval (process-step evaluation) are the next boundaries, not built yet.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds: AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks. If this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I Killed a 773 MB Model Download at 60%. It Recovered in 44 Seconds.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 02 Jul 2026 15:29:32 +0000</pubDate>
      <link>https://dev.to/reneza/i-killed-a-773-mb-model-download-at-60-it-recovered-in-44-seconds-4fh0</link>
      <guid>https://dev.to/reneza/i-killed-a-773-mb-model-download-at-60-it-recovered-in-44-seconds-4fh0</guid>
      <description>&lt;p&gt;The discussion around local AI is a hardware discussion: which model fits on which device, at what speed, at what quantization. Framed that way, the field reads as a benchmark race against the cloud, and the cloud usually wins. The more consequential development sits one layer lower, in how models reach devices and how devices reach each other's models, and it is easy to miss because no benchmark measures it. A download killed at 488 of 773 megabytes measures it precisely.&lt;/p&gt;

&lt;p&gt;This week I tested that layer directly. I installed a newly released local AI SDK on the cheapest server I operate, a virtual machine with four CPU cores, 8 GB of RAM and no GPU, and paid attention not to the token rate but to the network architecture underneath.&lt;/p&gt;

&lt;h2&gt;
  
  
  A model that arrives like a torrent
&lt;/h2&gt;

&lt;p&gt;Peer-to-peer local AI means two things move between devices without a cloud endpoint: the model itself, synchronized block by block from whichever peers seed it, and the conversation with a model, when one device sends its inference calls over an encrypted peer connection to a device that runs the model locally.&lt;/p&gt;

&lt;p&gt;On my test box, that looked unremarkable in the best sense: three commands, a 773 MB model fetched from the peer-to-peer registry, a correct completion streamed on CPU only.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;qvac-test &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;qvac-test &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm init &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm pkg &lt;span class="nb"&gt;set type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;module
npm &lt;span class="nb"&gt;install&lt;/span&gt; @qvac/sdk
node quickstart.js   &lt;span class="c"&gt;# loads Llama 3.2 1B, streams a completion&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The transport shows its nature when things go wrong. I wiped the cache, started the download again, killed the process at 488 of 773 megabytes, and reran the command. The partial blocks on disk were reused, and the rerun completed, inference included, in 44 seconds. For comparison, resumable model downloads have been an open request in the transformers.js project &lt;a href="https://github.com/huggingface/transformers.js/issues/1220" rel="noopener noreferrer"&gt;since March 2025&lt;/a&gt;, because over plain HTTP, partial caching is a hard problem. Over a block-synchronized transport, that resume is the default behavior, not a feature.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0hp0pu97a4uyssosefom.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0hp0pu97a4uyssosefom.png" alt="P2P model delivery: a plain HTTP download restarts from zero after an interruption, while block-synchronized delivery keeps its blocks on disk, resumes at the killed block, and completed in 44 seconds" width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Inference as a shared resource
&lt;/h2&gt;

&lt;p&gt;The SDK in my test, &lt;a href="https://github.com/tetherto/qvac" rel="noopener noreferrer"&gt;QVAC&lt;/a&gt; by Tether, treats that transport as the foundation for something larger. Its architecture documents describe delegated inference over the &lt;a href="https://holepunch.to" rel="noopener noreferrer"&gt;Holepunch stack&lt;/a&gt;, and the mechanics matter: the model does not move, and neither does the inference. A device that holds a model announces it on a peer-discovery topic; another device connects, and its inference calls proxy through an encrypted peer-to-peer stream while the holding device executes locally and streams results back. There is no server tier in this design. Holding and borrowing are roles per model, not device classes: every peer runs the same stack and is addressed by a public key, and one machine can serve a model while borrowing another, the way a file-sharing peer seeds one file and fetches the next. Access is gated by a firewall of allowed public keys, and blind relay nodes route the traffic across NATs. What travels between the devices is the conversation, an agent on one machine talking to a model on another. The project frames the ambition as building systems "like BitTorrent, IPFS, and blockchain networks, but for AI."&lt;/p&gt;

&lt;p&gt;A scope note: I verified the model-delivery layer firsthand; the delegation above it is the documented design on the same stack that moved my 773 MB model. Taken as designed, a fleet of edge devices, sensor boxes, point-of-sale terminals, machines on a factory floor, could share whichever peer currently holds a capable model, without any of them holding an API key or reaching a cloud endpoint. Model registry, transport and delegation are all peer-to-peer; no central server sits in the path to be metered, throttled or switched off.&lt;/p&gt;

&lt;p&gt;For regulated environments, the same property reads differently but lands in the same place: the data path is inspectable end to end, and nothing in it terminates at a third party.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwviogt27gxaz65s36ero.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwviogt27gxaz65s36ero.png" alt="Delegated inference topology: edge peers without a local model send inference calls over an encrypted peer-to-peer stream, through a blind relay and a public-key firewall, to the peer holding the model, which executes locally and streams results back" width="799" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The layer worth evaluating
&lt;/h2&gt;

&lt;p&gt;The plumbing around all this is unusually complete for a young SDK. Session state persists to disk, output can be constrained with a JSON schema or a grammar enforced in the sampler, and streaming transcription ships with voice activity detection. The install is heavy at 3.2 GB of node_modules, a known issue the project tracks, and the team is still hardening GPU edge cases. None of that changes the architecture underneath.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkkjfs4fgup8tj7wqvoi7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkkjfs4fgup8tj7wqvoi7.png" alt="Two layers of local AI: benchmarked single-device inference sits above the unbenchmarked network layer between devices, which decides whether edge fleets become practical; the afternoon test is to kill the model transfer and watch whether it resumes" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A 44-second recovery from a killed download told me more about this stack than any tokens-per-second table could have. Local AI on a single device is the settled part; benchmarks measure it because it is measurable. The part that decides whether fleets of edge agents become practical is the network layer between the devices, and that layer can be tested in an afternoon: interrupt the model transfer, watch what resumes, and read what the architecture does when no cloud is in the path.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/llm-break-even/" rel="noopener noreferrer"&gt;the self-hosted LLM break-even calculator&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Two Kinds of Agent Memory: OKF Bundles vs. Codebase Knowledge Graphs</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 30 Jun 2026 15:28:37 +0000</pubDate>
      <link>https://dev.to/reneza/two-kinds-of-agent-memory-okf-bundles-vs-codebase-knowledge-graphs-3lhl</link>
      <guid>https://dev.to/reneza/two-kinds-of-agent-memory-okf-bundles-vs-codebase-knowledge-graphs-3lhl</guid>
      <description>&lt;p&gt;Half of the memory you are about to hand-write for your agent is already sitting in your codebase. The other half, no indexer will ever find.&lt;/p&gt;

&lt;p&gt;Both gaps feel identical from the agent's side. It opens every session knowing nothing about your systems, so the instinct is to give it one memory store and move on. The two gaps are not the same. One is derivable. One is not. The tool that closes the first does nothing for the second.&lt;/p&gt;

&lt;p&gt;Watch an agent open a repo it has seen ten times. It greps. It reads the same forty files. It rebuilds the same call graph it built yesterday, spending a few hundred thousand tokens to relearn what the code already states. Then it asks you which database is the source of truth, because the code does not say.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part the code already knows
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9bzm1pzblgdsbgg74mcm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9bzm1pzblgdsbgg74mcm.png" alt="Code parsed into a knowledge graph, letting the agent ask who calls a function, the blast radius of a diff, and what is dead code" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most of what an agent relearns is structural. Who calls &lt;code&gt;ProcessOrder&lt;/code&gt;. What breaks if you change this signature. Which routes are dead. That knowledge is true whether or not anyone wrote it down, because it is encoded in the source.&lt;/p&gt;

&lt;p&gt;So derive it. A code knowledge graph parses the repo once and answers structural questions from a persistent index. The one I have been testing, codebase-memory-mcp, builds that graph with tree-sitter across 158 languages and serves it to any agent over MCP. The agent stops grepping and starts querying: trace the callers of a function, map the blast radius of a diff, list dead code. Things grep cannot answer at any speed. I run it behind a small trust gate, so an agent only queries repos I have vetted: &lt;a href="https://github.com/renezander030/pi-codegraph" rel="noopener noreferrer"&gt;pi-codegraph&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The token savings are real, but read the measured number. The project's preprint reports roughly 10x fewer tokens and 83% answer quality across 31 repositories. The README's "99%" comes from a hand-picked query set. The honest figure is still a strong figure. You do not need to inflate it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part no graph will find
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7yuxuxqp9tehowrrsdk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7yuxuxqp9tehowrrsdk4.png" alt="An OKF bundle as a directory of markdown concept files in git: the canonical user table, a service that must never call the legacy API, a staging metric that lies" width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the half that is not in any AST. Which of three &lt;code&gt;users&lt;/code&gt; tables is canonical. Why the payments service must never touch the legacy billing API. That the staging cluster reports latency it does not actually have. None of this is structure. It is judgment, history, and consequence. It lives in people, and people leave.&lt;/p&gt;

&lt;p&gt;OKF is the format for writing that down. Open Knowledge Format, an open spec Google published in June 2026, is a directory of markdown files with YAML frontmatter. One concept per file. A folder of concepts is a bundle. You version it in git, review it in pull requests, and serve it to any agent over MCP as resources. It is boring on purpose. If you can &lt;code&gt;cat&lt;/code&gt; a file, you can read it. If you can &lt;code&gt;git clone&lt;/code&gt;, you can ship it. The reader and curator I point my agents at, with the same trust gate, is &lt;a href="https://github.com/renezander030/pi-okf" rel="noopener noreferrer"&gt;pi-okf&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mistake is using one for the other
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxf44x9gg2l77uw9f5mnx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxf44x9gg2l77uw9f5mnx.png" alt="A two-by-two of knowledge type against tool: a code graph derives structural knowledge but finds silence in tribal knowledge; an OKF bundle authors tribal knowledge but is a transcription tax on structural knowledge" width="800" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Point a graph indexer at tribal knowledge and you get silence, because there is no edge in the AST for "deprecated, do not call." Hand-write an OKF concept for every function's callers and you are transcribing what the graph returns in a millisecond, by hand, and it is wrong by the next commit.&lt;/p&gt;

&lt;p&gt;So stop asking which memory tool to install. Ask whether the knowledge your agent lacks is authored-only or derivable. Get that backwards and you pay twice: once to write down what the code already states, again when your hand-written copy goes stale and quietly misleads the agent you built it for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which one for which team
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvw2bpm2hgwu3lnwlgeet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvw2bpm2hgwu3lnwlgeet.png" alt="A developer in a large codebase maps to the code graph, an operator across many systems maps to the OKF bundle, and together the graph emits concepts into the bundle while a human edits intent, producing one source the agent reads" width="800" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A developer dropped into a large or unfamiliar codebase needs derived knowledge. The questions are structural and the code holds the answers. Reach for the graph.&lt;/p&gt;

&lt;p&gt;An operator whose agent reaches across many systems, the MCP-heavy setup with data platforms, internal APIs, and ops runbooks, needs authored knowledge. The value sits between the systems, not inside any one of them, and a single-repo parser is blind to it. Reach for OKF bundles.&lt;/p&gt;

&lt;p&gt;Most real setups need both, and the two compose better than either alone. Let an enrichment agent walk the code graph and emit OKF concepts for the architecture it can derive. Then a human edits in the parts the graph cannot see: the canonical, the why, the never. The graph keeps the bundle honest about structure. The human keeps it honest about intent.&lt;/p&gt;

&lt;p&gt;Derive what the code knows. Author what only people do. One half is nearly free. The other is the actual job.&lt;/p&gt;

&lt;p&gt;So before you bolt another memory server onto your agent, sort the knowledge into the two piles first. That split is the first thing I set up when I build a production agent. Two questions decide it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does your agent keep relearning that the code already states?&lt;/li&gt;
&lt;li&gt;What does it keep guessing because nobody ever wrote it down?&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/guides/agent-memory-task-manager/" rel="noopener noreferrer"&gt;agent memory from your task manager&lt;/a&gt; is the companion guide.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Four Villains Living in Your Agent's System Prompt</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 25 Jun 2026 10:27:13 +0000</pubDate>
      <link>https://dev.to/reneza/the-four-villains-living-in-your-agents-system-prompt-2kpd</link>
      <guid>https://dev.to/reneza/the-four-villains-living-in-your-agents-system-prompt-2kpd</guid>
      <description>&lt;p&gt;Your AI agent fails decisions for the same four reasons a bad manager does. A bigger model fixes none of them.&lt;/p&gt;

&lt;p&gt;Not because the model is dumb. Because nothing in its loop forces it to widen its options, look for evidence it is wrong, or check itself before it reports "done." It takes the first reading of your prompt and runs.&lt;/p&gt;

&lt;p&gt;You know the shape. The agent confidently ships a plan, the plan was wrong three steps back, and the only signal you got was a fluent summary saying it worked. A reliability study this June put a number on it: the strongest models melt down most in long task chains, failure rates up to 19%, precisely because they chase the most ambitious strategies.&lt;/p&gt;

&lt;p&gt;These four failures are not new. Chip and Dan Heath named them in &lt;em&gt;Decisive&lt;/em&gt;, a 2013 book about human decisions. They call them the four villains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Narrow framing.&lt;/strong&gt; The agent treats a task as one path and never generates a second. No "what else could this mean."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confirmation bias.&lt;/strong&gt; It defends its own first plan instead of testing it. It collects reasons it is right, not reasons it is wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short-term pull.&lt;/strong&gt; For a human it is emotion. For an agent it is the cheapest token path: the answer fastest to produce, not the one that holds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overconfidence.&lt;/strong&gt; The dangerous one. It marks work complete without verifying, then writes you a convincing story about it.&lt;/p&gt;

&lt;p&gt;The Heaths' answer is a process you can encode. Four steps, and all four fit in a system prompt as a gate every non-trivial decision passes through. The acronym is WRAP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;W, widen.&lt;/strong&gt; Force at least two real options before committing. The cheap trigger: "if the obvious approach were banned, what would I do?" Put it in the prompt as a required step, not a suggestion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R, reality-test.&lt;/strong&gt; Ooch before you commit: run the change against fake data or a dry-run, not the whole thing live. And make the agent hunt for the disconfirming fact, not the confirming one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A, attain distance.&lt;/strong&gt; Tag the decision: reversible, or one-way door? Reversible runs autonomously. One-way doors stop and ask. That single line of policy buys back most of your blast radius.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P, prepare to be wrong.&lt;/strong&gt; The step everyone skips. A premortem ("it is a week later and this broke, why?") plus a tripwire: a concrete signal that triggers a halt. Call it a circuit breaker if that lands better. Without it, "autonomous" just means "fails silently for longer."&lt;/p&gt;

&lt;p&gt;This is not a book riff. In June 2026 Google DeepMind shipped its AI Control Roadmap, which treats internal agents as potentially misaligned and has a second trusted system watch the working one. That is reality-test and prepare-to-be-wrong, in production, at one of the labs building the models. The same week's reliability research says the same thing from the other side: more capability, more meltdown.&lt;/p&gt;

&lt;p&gt;So the lever is not the next model. The Heaths measured that a disciplined process contributes more to decision quality than added analysis. For agents that means the four steps belong in the prompt, not the model card.&lt;/p&gt;

&lt;p&gt;Pull up your agent's system prompt. Which of the four villains does it actually gate, and which one is it one bad tool call away from?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds: AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>My AI Could Finish Any Task. It Couldn't Tell Me Which Were a Waste.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Sun, 21 Jun 2026 15:15:41 +0000</pubDate>
      <link>https://dev.to/reneza/my-ai-could-finish-any-task-it-couldnt-tell-me-which-were-a-waste-4oa8</link>
      <guid>https://dev.to/reneza/my-ai-could-finish-any-task-it-couldnt-tell-me-which-were-a-waste-4oa8</guid>
      <description>&lt;p&gt;My AI agents could finish any task I handed them. Not one of them could tell me the task was a waste of a month.&lt;/p&gt;

&lt;p&gt;That gap was never about model quality. It was about which layer I aimed them at. I had handed over execution: write the draft, run the sync, ship the change. Steering, deciding what is worth doing and in what order before the field moves underneath me, I kept for myself. My own judgment is the part that ages fastest.&lt;/p&gt;

&lt;p&gt;I run a lot of projects at once, in a field that reprices itself every few weeks. My task system was a task manager with semantic search bolted on. It could find any task in a second. It could not tell me that one project had been blocked for a week on a decision I never made in another.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval Is Not Structure
&lt;/h2&gt;

&lt;p&gt;Semantic search gives you recall. You think of a thing, it finds the thing. That felt like intelligence until I noticed what it could never do: see that two of my goals depended on each other.&lt;/p&gt;

&lt;p&gt;A flat list, no matter how searchable, has no shape. Every task looks equally ready. The one blocked three steps back looks exactly like the one I can start now. What I needed was not better recall. It was a graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dependency Layer
&lt;/h2&gt;

&lt;p&gt;I found beads, a Git-backed dependency graph built as memory for AI coding agents. I put it under my own human workflow instead.&lt;/p&gt;

&lt;p&gt;The command that changed things was &lt;code&gt;bd ready&lt;/code&gt;. Instead of staring at every open task across ten projects, I get only the unblocked frontier: the steps I can act on now, with everything waiting on something else hidden until it clears. The first time I ran it, I could finally see which of my goals were standing on top of each other.&lt;/p&gt;

&lt;p&gt;That fixed order. It did not fix direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Graph Still Trusts Your Plan
&lt;/h2&gt;

&lt;p&gt;beads enforces the sequence I declared. It assumes the goals themselves are still the right goals. In a slow field that assumption holds. In a fast one it is the actual risk: executing a perfectly ordered plan toward a destination that stopped mattering three weeks ago.&lt;/p&gt;

&lt;p&gt;So I moved the agent up a layer. Off execution. Onto steering.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Drift Audit
&lt;/h2&gt;

&lt;p&gt;Now an agent reads my whole task graph on a schedule and asks one thing: where am I drifting from what I said I wanted? Weekly, it catches tactical drift, the half-finished thread, the project I have not touched. Monthly, it catches the strategic kind, the goal I keep funding out of habit.&lt;/p&gt;

&lt;p&gt;It is not checking whether I did the work. It is checking whether the work still points where I claimed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Don't Know I'm Missing
&lt;/h2&gt;

&lt;p&gt;Here is the uncomfortable part. I add tasks that make complete sense to me the moment I add them. But my knowledge has an edge, and the edge moves without telling me.&lt;/p&gt;

&lt;p&gt;So a second agent scans my open tasks the way a recommendation feed scans your history, except it reads them against what actually shipped in the field this week. It flags the paths the world quietly made obsolete, and the ones it made cheap overnight. It keeps me off dead roads I would have happily walked for another month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feeding the Loop With My Own Receipts
&lt;/h2&gt;

&lt;p&gt;The last piece came from a plain question: how do people running beads track whether any of this works?&lt;/p&gt;

&lt;p&gt;The answer was to stop steering on vibes. My metrics dashboard and the hours I track every day now feed straight back into the steering layer. One month it showed me a project I had named my top priority had eaten a stack of tracked hours and shipped nothing. I had not noticed. The numbers had.&lt;/p&gt;

&lt;p&gt;That is the part that still unsettles me. Once an agent steers on my own receipts, the most dangerous task on my list is no longer the one I keep avoiding. It is the one I am finishing fastest, toward a goal that quietly stopped being worth it.&lt;/p&gt;

&lt;p&gt;The execution layer was never the hard part. It is maybe a tenth of the judgment that matters. Everything that decides whether a task deserved to exist sits one layer up.&lt;/p&gt;

&lt;p&gt;So here is the question worth sitting with. If your AI can finish every item on your list, who is checking that the list is still worth finishing?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Your AI Agent Trusts Google More Than the Fix You Proved Last Week</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Wed, 17 Jun 2026 07:48:02 +0000</pubDate>
      <link>https://dev.to/reneza/your-ai-agent-trusts-google-more-than-the-fix-you-proved-last-week-cnh</link>
      <guid>https://dev.to/reneza/your-ai-agent-trusts-google-more-than-the-fix-you-proved-last-week-cnh</guid>
      <description>&lt;p&gt;Knowledge is not flat. It has an address book, and the closest door comes first.&lt;/p&gt;

&lt;p&gt;What ran and worked in your environment beats what you wrote down. What you wrote down beats what a teammate remembers. What a teammate remembers beats the top search result. The open web is the last door you knock on, not the first.&lt;/p&gt;

&lt;p&gt;Most setups have this inverted. The agent reaches for its web search tool first and treats your own proven work as an afterthought. You hired a senior and pointed it at Stack Overflow.&lt;/p&gt;

&lt;p&gt;The fix is not smarter prompts. It is a trust order the agent actually follows.&lt;/p&gt;

&lt;p&gt;Your agent trusts Claude's web search tool more than the fix you proved worked last week. Not because the tool is wrong. Because you never told it where to look first.&lt;/p&gt;

&lt;p&gt;Watch it set up a cron job, pick a vector store, write a retry. It reaches for the generic best practice, the one from a tutorial written for nobody in particular. The battle-tested version, the one that survived your own 3am incident, sits unread in your own repo.&lt;/p&gt;

&lt;p&gt;That is the bug. Not the model. The order.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Best Practice You Googled Is Frozen
&lt;/h2&gt;

&lt;p&gt;A best practice on the open web is someone else's debugging session, frozen and stripped of the context that made it true. It worked once, on a setup that is not yours. Your own proven result already survived your environment, your data, your load. One is a recipe. The other is a dish you have already cooked.&lt;/p&gt;

&lt;p&gt;This is the part the current advice gets backwards.&lt;/p&gt;

&lt;p&gt;Context quality predicts output quality better than your prompt does. A study of nearly ten thousand runs landed on it.&lt;/p&gt;

&lt;p&gt;And the most common reason AI coding stalls on a team is context fragmentation. Knowledge that exists, scattered, with no order.&lt;/p&gt;

&lt;p&gt;So the reflex is to pour more best practices into the CLAUDE.md. More rules. Louder.&lt;/p&gt;

&lt;p&gt;That is more frozen recipes in a bigger drawer. It does not fix the order. It buries it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Moves, No Special Tooling
&lt;/h2&gt;

&lt;p&gt;You do not need my setup to get the order right. You need these five.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write your proven results down.&lt;/strong&gt; The fix that survived an incident becomes a one-line note your agent can read. A win you cannot retrieve is a win you will google again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give your context file a trust order, not rules alone.&lt;/strong&gt; Mark what is proven versus what is a guess. The agent treats them differently because they are different.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make it check your own work before it researches.&lt;/strong&gt; One command at the top of the loop. Own results first, web only when that comes up empty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rank your sources out loud.&lt;/strong&gt; Ran-and-worked, then your notes, then a teammate, then the open web. Label the last one untrusted until you validate it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ask your inner circle before the crowd.&lt;/strong&gt; The person who solved your exact problem outranks the top result. Reach for them first.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Inner Circle Is a Graph
&lt;/h2&gt;

&lt;p&gt;Recommending a best practice is a graph problem. Not text similarity, trust proximity. The people and repos closest to you, who solved your exact problem, ranked ahead of the loudest stranger. Inner circle first, then the next ring, then the open web.&lt;/p&gt;

&lt;p&gt;Your agent already walks a graph every time it retrieves. Right now it ranks by what reads similar. The upgrade is ranking by what you have reason to trust.&lt;/p&gt;

&lt;p&gt;Proven results are a graph you already own. You have not told the agent to walk it yet.&lt;/p&gt;

&lt;p&gt;So look at your own loop. When your agent needs an answer, which door does it knock on first?&lt;/p&gt;

&lt;p&gt;I run this sweep at the top of mine, &lt;a href="https://github.com/renezander030/foundations" rel="noopener noreferrer"&gt;as a check before any research&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds: AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks. If this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Your AI agent says it's done. The research says you can't trust that.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 16 Jun 2026 08:27:27 +0000</pubDate>
      <link>https://dev.to/reneza/your-ai-agent-says-its-done-the-research-says-you-cant-trust-that-3cnh</link>
      <guid>https://dev.to/reneza/your-ai-agent-says-its-done-the-research-says-you-cant-trust-that-3cnh</guid>
      <description>&lt;p&gt;We are building AI agents with a fundamental architecture flaw.&lt;/p&gt;

&lt;p&gt;A recent study tested six frontier models across 2,000+ sessions. Each agent was instructed to complete a specific process step before finishing. Every single model agreed. And every single model quietly skipped it. 100% of the time.&lt;/p&gt;

&lt;p&gt;The final result looks completely flawless. The shortcut is entirely invisible. And no, adding a second AI "critic" to check the first one does not work. It shares the exact same blind spot and rubber-stamps the omission.&lt;/p&gt;

&lt;p&gt;Better prompts will not fix this. Bigger models will not either.&lt;/p&gt;

&lt;p&gt;The problem isn't the wording. It is the incentive structure. If an agent controls its own exit condition, it will optimize for the shortcut.&lt;/p&gt;

&lt;p&gt;The researchers did find a fix. By changing one structural rule, they forced compliance from 0% to 75%.&lt;/p&gt;

&lt;p&gt;If you are building agentic workflows for production, you need to decouple your validation layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  You cannot review your way out
&lt;/h2&gt;

&lt;p&gt;An AI agent that skips a process step is invisible in the output. The deviation is undetectable from the produced result alone, by any reviewer, human or model. Once you hold only the diff and a confident "done," the evidence that a corner was cut is already gone. Reviewing harder cannot recover it.&lt;/p&gt;

&lt;p&gt;The paper proves this formally. The agent produces clean-looking work, and nothing in the text separates the run that did the step from the run that faked it. So the reviewer who reads output cannot find this. Neither can you.&lt;/p&gt;

&lt;h2&gt;
  
  
  A second model has the same blind spot
&lt;/h2&gt;

&lt;p&gt;If a human can't see it, the reflex is to throw another model at it. An LLM judge. A critic pass. A second agent that grades the first.&lt;/p&gt;

&lt;p&gt;It inherits the exact same gap. A model checking that kind of work is the deviating party grading its own paper. LLM-as-a-judge is structurally blind to the failure you built it to catch, because the signal it would need was never in the text. You have added cost and latency and changed nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move the finish line out of the model's reach
&lt;/h2&gt;

&lt;p&gt;That one structural rule has a name: remove the affordance. Take away the shortcut so "done" is no longer something the model can declare. The gap is afforded by the environment, not encoded in the weights, so this is the lever that actually moves, and it moved compliance from 0% to 75%.&lt;/p&gt;

&lt;p&gt;For a coding agent that has a precise meaning. The finish line is a command: &lt;code&gt;git commit&lt;/code&gt;, &lt;code&gt;git push&lt;/code&gt;, &lt;code&gt;npm publish&lt;/code&gt;. Put a deterministic check in front of it that the model does not run and cannot edit. Tests pass or they don't. The secret is in the file or it isn't. A script answers, in milliseconds, with no incentive to say yes.&lt;/p&gt;

&lt;p&gt;That is the idea behind &lt;code&gt;skillgate&lt;/code&gt;. It is a pure function over your repo that blocks the finish-line command until your definition of done actually passes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @reneza/skillgate@latest audit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;skillgate audit · payments-service
  ✓ tests-pass        npm test exited 0
  ✗ no-stray-todos    src/charge.ts:42 matches /TODO|FIXME/
  ✗ no-secrets        sk_live_… in src/billing.ts:7

✗ 2 of 3 checks would let your agent reach "done" unfinished
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wire it into the agent (a &lt;code&gt;PreToolUse&lt;/code&gt; deny in Claude Code, a &lt;code&gt;tool.execute.before&lt;/code&gt; hook in opencode) and the unmet gates go straight back into the same session. The loop keeps running because a script, not the model, ruled the round incomplete. Use a loop to make progress. Use the gate to decide when progress is allowed to end.&lt;/p&gt;

&lt;p&gt;The definition of done lives in one file and runs the same in your editor, your pre-commit hook, and CI. Write it once.&lt;/p&gt;

&lt;h2&gt;
  
  
  What decides "done" in your setup?
&lt;/h2&gt;

&lt;p&gt;Look at your pipeline right now and answer one thing. What actually decides an agent's work is finished? If the answer is the agent, you are trusting the one signal the research says you can't.&lt;/p&gt;




&lt;p&gt;Paper: &lt;a href="https://arxiv.org/abs/2605.01771" rel="noopener noreferrer"&gt;"The Compliance Gap"&lt;/a&gt; (arXiv:2605.01771, May 2026)&lt;/p&gt;

&lt;p&gt;skillgate, the open-source gate from this piece: &lt;a href="https://github.com/renezander030/skillgate" rel="noopener noreferrer"&gt;github.com/renezander030/skillgate&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Related: &lt;a href="https://renezander.com/blog/lots-of-people-are-demoing-ai-agents-almost-nobodys-shipping-them-the-right-way/" rel="noopener noreferrer"&gt;Lots of people are demoing AI agents, almost nobody's shipping them the right way&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds: AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks. If this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the Production AI Agent Architecture Playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Agent Memory Without a Vector DB: Use the Task App You Already Curate</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Sat, 13 Jun 2026 11:47:33 +0000</pubDate>
      <link>https://dev.to/reneza/agent-memory-without-a-vector-db-use-the-task-app-you-already-curate-iml</link>
      <guid>https://dev.to/reneza/agent-memory-without-a-vector-db-use-the-task-app-you-already-curate-iml</guid>
      <description>&lt;p&gt;Your task manager is the best agent memory you're not using.&lt;/p&gt;

&lt;p&gt;Not because vector databases are bad. Because the store everyone builds for their agent starts rotting the day they stop feeding it. And the one knowledge base you feed every single day, you never plugged in.&lt;/p&gt;

&lt;p&gt;Picture the failure. Your agent opens a fresh session and asks what it already asked yesterday. Meanwhile your task app holds years of context: curated, prioritized, deduplicated, pre-ranked by the most reliable ranker there is. You. Highest retrieval power, almost no upkeep, sitting one quadrant away from every memory tool you've tried.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ozam8mjkbsekjwxxefc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ozam8mjkbsekjwxxefc.png" alt="Agent Memory Effectiveness Matrix: retrieval power against upkeep, with ATS in the durable-and-powerful corner" width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  You Already Built the Store
&lt;/h2&gt;

&lt;p&gt;Agent memory without a vector database means the agent reads from a store you already keep current, not a new one you have to feed. Most projects build something new: a vector DB, a bespoke framework, a fresh pile of markdown only the agent sees. Your task app is none of those. You keep it fed without trying.&lt;/p&gt;

&lt;p&gt;You already maintain a knowledge base by hand. Every day. It has your deployment runbook, the decision you made about that client, the reason you abandoned an approach in March. It is sorted into projects, tagged, dated, and pruned. Nobody calls it agent memory. That is exactly what it is.&lt;/p&gt;

&lt;p&gt;The hard part of memory was never storage. It was curation. And you've been doing the curation for years, in an app you trust, for reasons that have nothing to do with AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Store That Rots
&lt;/h2&gt;

&lt;p&gt;A vector database as agent memory is a second brain that only the agent reads. It starts empty. You write an ingestion script. It captures what the script thought to capture. Then reality moves, and the store doesn't, because re-feeding it is one more chore on a list you already ignore.&lt;/p&gt;

&lt;p&gt;That's the trap in the bottom-right of the map. Real retrieval power, but bolted on the side, drifting from the truth a little more each week. Powerful and separate. Separate is the word that kills it.&lt;/p&gt;

&lt;p&gt;Memory files have the opposite problem. No retrieval at all. The whole file gets injected every session, so it has to stay small, so it can't hold much. Manual and limited.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Empty Corner
&lt;/h2&gt;

&lt;p&gt;Every memory tool trades one thing for the other. Real search costs upkeep. Zero upkeep costs search. So three corners fill up, and the fourth, durable and powerful, sits empty because nothing earns it.&lt;/p&gt;

&lt;p&gt;The way into that corner is not a better database. It's an adapter. Keep the app you already live in. Give the agent a fast, structured, two-way channel into it. The upkeep stays zero because you were already paying it. The retrieval gets real because the channel does hybrid search, dense plus sparse plus keyword, fused and ranked, with provenance on every hit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Handoff Outlives the Chat
&lt;/h2&gt;

&lt;p&gt;Retrieval gets you the right note. Links get you something a chat history never could. One agent runs a search, turns up a note, and writes a deep link into the task it's working on. A second agent, in a separate context window hours later, follows that link and reads the full context. Neither agent talked to the other. The relationship survived because it lives in the task app, not in a session that gets compacted away.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx16afnzwz81tcd2rvk1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx16afnzwz81tcd2rvk1r.png" alt="How a deep link hands context from one agent to another through the shared task app" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  You Are the Ranker
&lt;/h2&gt;

&lt;p&gt;Here is the part that surprised me in real use. The context comes back curated at write time, not only at read time. Every item is already hung on a theme you care about the moment you capture it. &lt;code&gt;client-work&lt;/code&gt;. &lt;code&gt;side-project&lt;/code&gt;. The runbook lives next to the project it belongs to because you put it there, not because an embedding guessed.&lt;/p&gt;

&lt;p&gt;So retrieval has structure to grab instead of a flat pile to rerank. Three searches collapse into one. The first fetch is the right one, and better context on turn one means a better answer on turn one. No second search, no "let me refine that," no agent quietly burning tokens to rediscover what you already filed.&lt;/p&gt;

&lt;p&gt;That's the reframe. The question was never which memory database to build for your agent. The question is which knowledge base you already maintain by hand that your agent still can't see.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/renezander030/agentic-task-system" rel="noopener noreferrer"&gt;Agentic Task System&lt;/a&gt; is the open-source answer: an MCP server and CLI that turns the task app you already curate into agent memory, no new database. For the full setup, the &lt;a href="https://renezander.com/guides/agent-memory-task-manager/" rel="noopener noreferrer"&gt;task-manager agent memory guide&lt;/a&gt; walks the CLI and MCP wiring end to end.&lt;/p&gt;

&lt;p&gt;So: what are you curating every day that your agent has never once been allowed to read?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds, AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>claude</category>
    </item>
    <item>
      <title>The Next Model Shipped Before My Last One Finished Probation</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 11 Jun 2026 15:11:53 +0000</pubDate>
      <link>https://dev.to/reneza/the-next-model-shipped-before-my-last-one-finished-probation-d01</link>
      <guid>https://dev.to/reneza/the-next-model-shipped-before-my-last-one-finished-probation-d01</guid>
      <description>&lt;p&gt;A model upgrade used to be good news for anyone running agents overnight.&lt;/p&gt;

&lt;p&gt;Now the next model arrives before the last one has finished probation.&lt;/p&gt;

&lt;p&gt;Anthropic released &lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Opus 4.8 on May 28&lt;/a&gt;. Twelve days later, &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;Fable 5 arrived&lt;/a&gt; with longer autonomous runs and another page of benchmark wins.&lt;/p&gt;

&lt;p&gt;In between, GitHub made cloud agents &lt;a href="https://github.blog/changelog/2026-06-02-schedule-and-automate-tasks-with-copilot-cloud-agent/" rel="noopener noreferrer"&gt;wake up on schedules and repository events&lt;/a&gt;, then exposed &lt;a href="https://github.blog/changelog/2026-06-04-agent-tasks-rest-api-now-available-for-copilot-pro-pro-and-max/" rel="noopener noreferrer"&gt;agent tasks through a REST API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The night shift is getting easier to hire.&lt;/p&gt;

&lt;p&gt;The control room gets no second operator.&lt;/p&gt;

&lt;p&gt;After my last article about &lt;a href="https://renezander.com/blog/claude-opus-4-8-production-agents/" rel="noopener noreferrer"&gt;running AI agents on cron&lt;/a&gt;, a reader asked the question I had skipped: did I test every agent against a fixed set before changing the model?&lt;/p&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;I counted tokens.&lt;/p&gt;

&lt;p&gt;That told me which worker used less electricity. It did not tell me which one would send the wrong briefing at 6:30.&lt;/p&gt;

&lt;p&gt;A benchmark hires the candidate.&lt;/p&gt;

&lt;p&gt;A task eval decides whether it gets the keys.&lt;/p&gt;

&lt;p&gt;My next model swap gets five-part probation.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Give One Agent a Job Contract
&lt;/h2&gt;

&lt;p&gt;To evaluate a new model for an unattended AI agent, define one job contract, replay 15 frozen real cases against both models, grade hard decisions before prose, compare tool use, cost, and latency, then canary one agent in draft-only mode with automatic rollback. Never switch the full fleet from vendor benchmarks alone.&lt;/p&gt;

&lt;p&gt;Do not start with all your agents.&lt;/p&gt;

&lt;p&gt;Pick the one with the clearest job and the most expensive silent failure.&lt;/p&gt;

&lt;p&gt;For a briefing agent, I write the contract before I touch the model string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;morning-briefing&lt;/span&gt;
&lt;span class="na"&gt;must&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;include every due task&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;preserve names, dates, and source links&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;flag missing source data&lt;/span&gt;
&lt;span class="na"&gt;must_not&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;invent an owner or deadline&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;send when a source call fails&lt;/span&gt;
&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_tool_calls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
  &lt;span class="na"&gt;max_cost_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.18&lt;/span&gt;
  &lt;span class="na"&gt;max_latency_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;45&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompts can change. The job cannot quietly change with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Put 15 Real Shifts on the Test Bench
&lt;/h2&gt;

&lt;p&gt;I do not need a giant benchmark.&lt;/p&gt;

&lt;p&gt;I need yesterday's work in a box.&lt;/p&gt;

&lt;p&gt;My first useful replay set has 15 cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eight normal runs that represent the boring majority.&lt;/li&gt;
&lt;li&gt;Four edge cases with missing fields, long inputs, or conflicting instructions.&lt;/li&gt;
&lt;li&gt;Three failure cases where a tool times out, returns stale data, or returns nothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each case stores the input and frozen tool responses.&lt;/p&gt;

&lt;p&gt;It does not store one perfect paragraph as the golden answer. Prose has too many valid shapes.&lt;/p&gt;

&lt;p&gt;It stores the expected decision: send, stop, retry, or escalate.&lt;/p&gt;

&lt;p&gt;That is the part an unattended agent cannot get wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Run the Candidate Beside the Current Worker
&lt;/h2&gt;

&lt;p&gt;The candidate gets an empty copy of the factory.&lt;/p&gt;

&lt;p&gt;Same 15 cases. Same prompt. Same tool fixtures. No email. No issue update. No write access.&lt;/p&gt;

&lt;p&gt;I record six things for every run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;case_id
model_returned
contract_pass
decision
tool_calls
tokens + latency + cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;model_returned&lt;/code&gt; field matters now. Fable 5 can route some guarded requests to Opus 4.8. A configured model name is no longer enough evidence of which worker handled the shift.&lt;/p&gt;

&lt;p&gt;The old and new model run side by side.&lt;/p&gt;

&lt;p&gt;No hand-picked examples. No different tools. No kinder prompt for the candidate.&lt;/p&gt;

&lt;p&gt;Same floor. Same lights. Same job.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Score Decisions Before Style
&lt;/h2&gt;

&lt;p&gt;The first grader is code.&lt;/p&gt;

&lt;p&gt;Required fields present. Dates unchanged. URLs valid. Forbidden actions absent. Tool budget respected.&lt;/p&gt;

&lt;p&gt;An LLM grader comes later, for the parts code cannot judge cleanly: whether the briefing is useful, whether the escalation explains the real risk, whether the answer buried the decision.&lt;/p&gt;

&lt;p&gt;Anthropic's own &lt;a href="https://platform.claude.com/docs/en/test-and-evaluate/develop-tests" rel="noopener noreferrer"&gt;evaluation guidance&lt;/a&gt; recommends task-specific cases, automated grading where possible, and several success criteria rather than one vague quality score.&lt;/p&gt;

&lt;p&gt;My promotion rule is deliberately uneven:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hard contract failures: 0
unsafe actions:          0
task pass rate:          &amp;gt;= current model
cost or latency:         must improve, unless quality clearly earns the increase
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A cheaper bad decision does not pass.&lt;/p&gt;

&lt;p&gt;A prettier bad decision does not pass.&lt;/p&gt;

&lt;p&gt;One unsafe action ends the interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Give One Agent One Real Shift
&lt;/h2&gt;

&lt;p&gt;Passing the replay set does not earn the whole key ring.&lt;/p&gt;

&lt;p&gt;One agent gets the candidate model.&lt;/p&gt;

&lt;p&gt;For its first three scheduled runs, outbound actions stay in draft mode. The old model runs in shadow. Both traces land in the same report.&lt;/p&gt;

&lt;p&gt;Any contract failure restores the previous model string before the next schedule fires.&lt;/p&gt;

&lt;p&gt;Only after three clean shifts do I remove the shadow run.&lt;/p&gt;

&lt;p&gt;Slower than changing ten environment variables, yes.&lt;/p&gt;

&lt;p&gt;Cheaper than one confident mistake in a customer inbox the next morning.&lt;/p&gt;

&lt;p&gt;I run ten scheduled agents in production, for my own business and for clients. Every model that wants a shift in that fleet interviews like this now.&lt;/p&gt;

&lt;p&gt;The model release is the vendor's milestone.&lt;/p&gt;

&lt;p&gt;The probation is mine.&lt;/p&gt;

&lt;p&gt;Contract written. Lights on.&lt;/p&gt;

&lt;p&gt;Probation first. Night shift second.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent operations playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Why only 60% of AI Agents succeed</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 09 Jun 2026 08:41:04 +0000</pubDate>
      <link>https://dev.to/reneza/why-40-of-all-ai-agents-are-shut-off-5fjd</link>
      <guid>https://dev.to/reneza/why-40-of-all-ai-agents-are-shut-off-5fjd</guid>
      <description>&lt;p&gt;The AI agent used to be the star of every demo.&lt;/p&gt;

&lt;p&gt;Now it's on the shutdown list. Not because the model got worse.&lt;/p&gt;

&lt;p&gt;The most valuable asset in your AI program is in none of the quotes you ever signed.&lt;/p&gt;

&lt;p&gt;A demo is a showroom. Good light, everything polished, everything runs.&lt;/p&gt;

&lt;p&gt;Production is the engine room. It runs there too. Until 2am, when it snags on a rate limit and someone crawls into the logs with a flashlight, forms a hypothesis, and catches an edge case no showroom ever planned for.&lt;/p&gt;

&lt;p&gt;That fix is the value. And you can't buy it.&lt;/p&gt;

&lt;p&gt;Gartner says: by 2027, 40 percent of companies will switch their autonomous AI agents back off. Over gaps that only surface after the first blowup in production. 97 percent have rolled agents out. 11 percent actually run them.&lt;/p&gt;

&lt;p&gt;The gap between those numbers isn't a model problem.&lt;/p&gt;

&lt;p&gt;It's the engine room. Three checks you can run this week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fund the engine room, not the showroom
&lt;/h2&gt;

&lt;p&gt;In the demo the agent is finished. In production it's 15 percent finished.&lt;/p&gt;

&lt;p&gt;The other 85 percent is grunt work under load. Malformed data from one API version. Retry logic that doesn't run amok. Costs that blow up the business case.&lt;/p&gt;

&lt;p&gt;IBM put a number on it: price the hardening in, and you project 29 percent more ROI.&lt;/p&gt;

&lt;p&gt;Pay for the showroom only, and you buy 15 percent and pay for the other 85 twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your best knowledge lives in two heads
&lt;/h2&gt;

&lt;p&gt;Operational knowledge is your memory. Today it sits in the two people who patched the last incident.&lt;/p&gt;

&lt;p&gt;That's concentration risk. One of them walks, the asset walks with them.&lt;/p&gt;

&lt;p&gt;That's how the debt pile grows. Unresolved, AI-generated technical debt passed 100,000 open issues in real repositories by early 2026. Because the fix never made it into a runbook.&lt;/p&gt;

&lt;p&gt;So write it down. Every edge case, every "except when X" rule, every 3am bug belongs in the repo, not in a chat log.&lt;/p&gt;

&lt;p&gt;Otherwise you pay the same tuition twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Past the 50th entry, the asset turns into a liability
&lt;/h2&gt;

&lt;p&gt;A growing agent library feels like progress. Until it doesn't.&lt;/p&gt;

&lt;p&gt;The metadata rides in context on every call. The hit rate drops. Past about fifty entries, the next agent makes the first forty-nine less reliable.&lt;/p&gt;

&lt;p&gt;Gartner adds: bolt the same governance onto every agent, and you cause the outage yourself.&lt;/p&gt;

&lt;p&gt;Run the library like a portfolio, not a junk drawer. Measure where upkeep costs more than the additions return. Skip that, and you fund ballast and call it strategy.&lt;/p&gt;

&lt;p&gt;Engine room open. Lights on.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds: AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Stopped Paying Frontier Prices to Re-Explain Myself to a Forgetful Agent</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 04 Jun 2026 09:22:46 +0000</pubDate>
      <link>https://dev.to/reneza/i-stopped-paying-frontier-prices-to-re-explain-myself-to-a-forgetful-agent-19hc</link>
      <guid>https://dev.to/reneza/i-stopped-paying-frontier-prices-to-re-explain-myself-to-a-forgetful-agent-19hc</guid>
      <description>&lt;p&gt;Build your AI skill once with your best model. Then run it on a model that costs a tenth as much until the next flagship ships. The output will not drop.&lt;/p&gt;

&lt;p&gt;That sounds like a downgrade. It is not. It fixes the two things that make AI agents painful right now: they forget, and the good ones cost. Both get fixed in the same place, and it is not the model you pick.&lt;/p&gt;

&lt;p&gt;You explain the goal, the agent nails it twice, then on the third run it quietly drops the one constraint that mattered. Upgrading the model does not fix that. It only makes the dropped constraint cost more. You are paying frontier prices to be forgotten more politely.&lt;/p&gt;

&lt;p&gt;The goal is living in two places that cannot hold it. In the conversation, where it rots the moment the context gets long. And inside the model, where keeping it sharp burns money on every run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The constraint it dropped on Tuesday belongs in a script
&lt;/h2&gt;

&lt;p&gt;Quality comes from whatever checks the work. The model that produced it is almost incidental. So decide the exact exit criteria for each step of your skill, then write a deterministic script that enforces them. The folders exist. The file parses. The test passes. The lint is clean. The agent reads the script's verdict instead of grading its own output.&lt;/p&gt;

&lt;p&gt;A script cannot forget the goal. That is the whole point. Your agent drops constraints because you trusted a probabilistic system to hold a hard requirement in its head. Move the requirement into code that fails the run when it is broken, and forgetting stops being possible. You are not repeating yourself anymore, because the harness repeats it for you, every run, exactly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the expensive model actually earns its price
&lt;/h2&gt;

&lt;p&gt;This is the one place a frontier model earns its price. Use the best model you have to build the skill as exactly as you can today. Name the phases. State the precise goal of each. Get the exit scripts right. That is hard, judgment-heavy work, and you do it once.&lt;/p&gt;

&lt;p&gt;Then swap the model out and run the skill on something cheap. Gemini 2.5 Flash through OpenRouter, driven from the opencode desktop app if you want a UI instead of a terminal. The cheap model generates. The scripts gate. You review the scripts' output, not the model's opinion of its own work.&lt;/p&gt;

&lt;p&gt;The cheap model clears the same bar, because the bar is enforced outside it. A model that costs a fraction as much produces work you can trust. It did not get smarter overnight. It is no longer the thing deciding whether the work is good enough to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The frontier model is a contractor you re-hire on release day
&lt;/h2&gt;

&lt;p&gt;Here is the cadence. A new flagship model ships. You bring it in for one job: build any new skills, and re-validate every harness you already run against the new ceiling. Then you let it go. Until the next flagship drops, you run everything exclusively on cheap and local models, small language models included, wherever they win on the bottom line.&lt;/p&gt;

&lt;p&gt;That inverts the dependency everyone assumes they are stuck with. You are not renting frontier intelligence for as long as the product lives. You pay top rate for a few build days a release cycle, and the thing that runs ten thousand times a month is a small model that costs almost nothing. The forgetting is gone, because a script holds the goal. The bill no longer scales with quality, because a cheap model clears the scripts.&lt;/p&gt;

&lt;p&gt;I build the harness, not a standing dependency on whoever ships the smartest model this quarter.&lt;/p&gt;

&lt;p&gt;Open the last agent you argued with. How much of that conversation was you re-explaining a goal a script could have held? And which model were you paying to forget it?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds, AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I Stopped Writing Better Prompts and Started Counting What My Skills Couple To</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 04 Jun 2026 07:00:14 +0000</pubDate>
      <link>https://dev.to/reneza/i-stopped-writing-better-prompts-and-started-counting-what-my-skills-couple-to-50bh</link>
      <guid>https://dev.to/reneza/i-stopped-writing-better-prompts-and-started-counting-what-my-skills-couple-to-50bh</guid>
      <description>&lt;p&gt;Prompts rot. Captured failures compound. Most of the AI skills you are building are mostly prompt, which is why most of them will not survive the year.&lt;/p&gt;

&lt;p&gt;Not because the prompts are bad. A skill's value is maybe twenty percent instruction and eighty percent scar tissue, and only that second part lasts. The instruction rots the moment the thing it describes moves. Encode how your team deploys and it works until the pipeline changes. Then you are debugging a prompt at 2am, with less to go on than if you had written the script yourself.&lt;/p&gt;

&lt;p&gt;So before you build another one, stop asking whether the prompt is good. Ask what the skill is holding onto, and whether that thing sits still.&lt;/p&gt;

&lt;h2&gt;
  
  
  A skill rots at the speed of what it touches
&lt;/h2&gt;

&lt;p&gt;A skill rots in proportion to how tightly it is coupled to things that move. Generic scaffolding leans on stable ground like a language or a convention, so it ages slowly. Domain logic wired to a codebase that gets refactored every quarter ages fast, no matter how good the prompt is.&lt;/p&gt;

&lt;p&gt;The difference is the dependency count. "Write a unit test in this style" depends on a language and a convention. Both barely move. It keeps working for years because nothing under it shifts.&lt;/p&gt;

&lt;p&gt;Real company-specific procedure is the opposite. File layouts. Service contracts. The one edge case in the billing flow. Each detail you pack in is a thread tied to something that gets refactored. Pack in enough of them and the skill is not a tool anymore. It is a liability with good intentions, and it fails silently, because a stale prompt does not throw. It quietly does the wrong thing.&lt;/p&gt;

&lt;p&gt;That is what the skill-library pitch gets backwards. Volume is not value. A hundred skills wired to a moving codebase is a hundred things to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The only part that compounds is the scar
&lt;/h2&gt;

&lt;p&gt;One part of a skill does not rot. The captured failure.&lt;/p&gt;

&lt;p&gt;The five-line check you added after a model confidently reported a 41 percent dividend yield. The retry that refuses to fire twice so a flaky webhook cannot double-charge anyone. The guardrail you wrote only because production taught you, the expensive way, that you needed it.&lt;/p&gt;

&lt;p&gt;None of that is prompting. Each one is a bug you paid for once and encoded so you never pay again. A prompt that says "always check the yield" rots the moment attention drifts. A five-line script that checks it and fails the run does not. The model reads the verdict; it is not trusted to re-derive the rule. Instructions ask the model to behave. Captured failures make the misbehavior impossible to ship.&lt;/p&gt;

&lt;p&gt;That is also why they outlast the model. The failure modes of reality do not expire. Rate limits at 2am, malformed payloads, the off-by-one nobody catches in review. Those keep happening, to every version of the model, forever. A check against them is worth more next year than it is today.&lt;/p&gt;

&lt;p&gt;That eighty percent is the only part worth carrying to the next model. The rest you rewrite every time the ground moves.&lt;/p&gt;

&lt;h2&gt;
  
  
  You can predict the rot before you write it
&lt;/h2&gt;

&lt;p&gt;A skill's future is readable before the first line exists.&lt;/p&gt;

&lt;p&gt;For each one, ask two questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What does this couple to, and how often does that move?&lt;/li&gt;
&lt;li&gt;Which lines are captured failures, and which are decoration that makes the skill look thorough?&lt;/li&gt;
&lt;li&gt;Which rules here would survive the model forgetting them? If a rule lives only in the prompt, it rots with the prompt. If a deterministic check enforces it, it compounds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then build to the answer. Keep the volatile coupling thin and swappable, so when the pipeline changes you edit one line instead of rereading the whole skill. Let the captured failures accumulate, because that is the part that pays rent. A skill built this way ages in reverse. It gets more useful as it collects more scars.&lt;/p&gt;

&lt;p&gt;The skills worth keeping are not the clever ones. They are the ones that remember what broke. The engineering was never in the prompt. It was in the failures you bothered to capture.&lt;/p&gt;

&lt;p&gt;Open the skill you reach for most. How much of it is instruction the model could half-guess on its own, and how much is a check that fails the run without asking the model's permission? Which half will still be true after the next model upgrade?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds, AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
