<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: René Zander</title>
    <description>The latest articles on DEV Community by René Zander (@reneza).</description>
    <link>https://dev.to/reneza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1138713%2Fa7d8635c-22db-4dec-b156-1fb07de64a8d.jpeg</url>
      <title>DEV Community: René Zander</title>
      <link>https://dev.to/reneza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/reneza"/>
    <language>en</language>
    <item>
      <title>Agent Memory Without a Vector DB: Use the Task App You Already Curate</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Sat, 13 Jun 2026 11:47:33 +0000</pubDate>
      <link>https://dev.to/reneza/agent-memory-without-a-vector-db-use-the-task-app-you-already-curate-iml</link>
      <guid>https://dev.to/reneza/agent-memory-without-a-vector-db-use-the-task-app-you-already-curate-iml</guid>
      <description>&lt;p&gt;Your task manager is the best agent memory you're not using.&lt;/p&gt;

&lt;p&gt;Not because vector databases are bad. Because the store everyone builds for their agent starts rotting the day they stop feeding it. And the one knowledge base you feed every single day, you never plugged in.&lt;/p&gt;

&lt;p&gt;Picture the failure. Your agent opens a fresh session and asks what it already asked yesterday. Meanwhile your task app holds years of context: curated, prioritized, deduplicated, pre-ranked by the most reliable ranker there is. You. Highest retrieval power, almost no upkeep, sitting one quadrant away from every memory tool you've tried.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ozam8mjkbsekjwxxefc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ozam8mjkbsekjwxxefc.png" alt="Agent Memory Effectiveness Matrix: retrieval power against upkeep, with ATS in the durable-and-powerful corner" width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  You Already Built the Store
&lt;/h2&gt;

&lt;p&gt;Agent memory without a vector database means the agent reads from a store you already keep current, not a new one you have to feed. Most projects build something new: a vector DB, a bespoke framework, a fresh pile of markdown only the agent sees. Your task app is none of those. You keep it fed without trying.&lt;/p&gt;

&lt;p&gt;You already maintain a knowledge base by hand. Every day. It has your deployment runbook, the decision you made about that client, the reason you abandoned an approach in March. It is sorted into projects, tagged, dated, and pruned. Nobody calls it agent memory. That is exactly what it is.&lt;/p&gt;

&lt;p&gt;The hard part of memory was never storage. It was curation. And you've been doing the curation for years, in an app you trust, for reasons that have nothing to do with AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Store That Rots
&lt;/h2&gt;

&lt;p&gt;A vector database as agent memory is a second brain that only the agent reads. It starts empty. You write an ingestion script. It captures what the script thought to capture. Then reality moves, and the store doesn't, because re-feeding it is one more chore on a list you already ignore.&lt;/p&gt;

&lt;p&gt;That's the trap in the bottom-right of the map. Real retrieval power, but bolted on the side, drifting from the truth a little more each week. Powerful and separate. Separate is the word that kills it.&lt;/p&gt;

&lt;p&gt;Memory files have the opposite problem. No retrieval at all. The whole file gets injected every session, so it has to stay small, so it can't hold much. Manual and limited.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Empty Corner
&lt;/h2&gt;

&lt;p&gt;Every memory tool trades one thing for the other. Real search costs upkeep. Zero upkeep costs search. So three corners fill up, and the fourth, durable and powerful, sits empty because nothing earns it.&lt;/p&gt;

&lt;p&gt;The way into that corner is not a better database. It's an adapter. Keep the app you already live in. Give the agent a fast, structured, two-way channel into it. The upkeep stays zero because you were already paying it. The retrieval gets real because the channel does hybrid search, dense plus sparse plus keyword, fused and ranked, with provenance on every hit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Handoff Outlives the Chat
&lt;/h2&gt;

&lt;p&gt;Retrieval gets you the right note. Links get you something a chat history never could. One agent runs a search, turns up a note, and writes a deep link into the task it's working on. A second agent, in a separate context window hours later, follows that link and reads the full context. Neither agent talked to the other. The relationship survived because it lives in the task app, not in a session that gets compacted away.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx16afnzwz81tcd2rvk1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx16afnzwz81tcd2rvk1r.png" alt="How a deep link hands context from one agent to another through the shared task app" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  You Are the Ranker
&lt;/h2&gt;

&lt;p&gt;Here is the part that surprised me in real use. The context comes back curated at write time, not only at read time. Every item is already hung on a theme you care about the moment you capture it. &lt;code&gt;client-work&lt;/code&gt;. &lt;code&gt;side-project&lt;/code&gt;. The runbook lives next to the project it belongs to because you put it there, not because an embedding guessed.&lt;/p&gt;

&lt;p&gt;So retrieval has structure to grab instead of a flat pile to rerank. Three searches collapse into one. The first fetch is the right one, and better context on turn one means a better answer on turn one. No second search, no "let me refine that," no agent quietly burning tokens to rediscover what you already filed.&lt;/p&gt;

&lt;p&gt;That's the reframe. The question was never which memory database to build for your agent. The question is which knowledge base you already maintain by hand that your agent still can't see.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/renezander030/agentic-task-system" rel="noopener noreferrer"&gt;Agentic Task System&lt;/a&gt; is the open-source answer: an MCP server and CLI that turns the task app you already curate into agent memory, no new database. For the full setup, the &lt;a href="https://renezander.com/guides/agent-memory-task-manager/" rel="noopener noreferrer"&gt;task-manager agent memory guide&lt;/a&gt; walks the CLI and MCP wiring end to end.&lt;/p&gt;

&lt;p&gt;So: what are you curating every day that your agent has never once been allowed to read?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds, AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>claude</category>
    </item>
    <item>
      <title>The Next Model Shipped Before My Last One Finished Probation</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 11 Jun 2026 15:11:53 +0000</pubDate>
      <link>https://dev.to/reneza/the-next-model-shipped-before-my-last-one-finished-probation-d01</link>
      <guid>https://dev.to/reneza/the-next-model-shipped-before-my-last-one-finished-probation-d01</guid>
      <description>&lt;p&gt;A model upgrade used to be good news for anyone running agents overnight.&lt;/p&gt;

&lt;p&gt;Now the next model arrives before the last one has finished probation.&lt;/p&gt;

&lt;p&gt;Anthropic released &lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Opus 4.8 on May 28&lt;/a&gt;. Twelve days later, &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;Fable 5 arrived&lt;/a&gt; with longer autonomous runs and another page of benchmark wins.&lt;/p&gt;

&lt;p&gt;In between, GitHub made cloud agents &lt;a href="https://github.blog/changelog/2026-06-02-schedule-and-automate-tasks-with-copilot-cloud-agent/" rel="noopener noreferrer"&gt;wake up on schedules and repository events&lt;/a&gt;, then exposed &lt;a href="https://github.blog/changelog/2026-06-04-agent-tasks-rest-api-now-available-for-copilot-pro-pro-and-max/" rel="noopener noreferrer"&gt;agent tasks through a REST API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The night shift is getting easier to hire.&lt;/p&gt;

&lt;p&gt;The control room gets no second operator.&lt;/p&gt;

&lt;p&gt;After my last article about &lt;a href="https://renezander.com/blog/claude-opus-4-8-production-agents/" rel="noopener noreferrer"&gt;running AI agents on cron&lt;/a&gt;, a reader asked the question I had skipped: did I test every agent against a fixed set before changing the model?&lt;/p&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;I counted tokens.&lt;/p&gt;

&lt;p&gt;That told me which worker used less electricity. It did not tell me which one would send the wrong briefing at 6:30.&lt;/p&gt;

&lt;p&gt;A benchmark hires the candidate.&lt;/p&gt;

&lt;p&gt;A task eval decides whether it gets the keys.&lt;/p&gt;

&lt;p&gt;My next model swap gets five-part probation.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Give One Agent a Job Contract
&lt;/h2&gt;

&lt;p&gt;To evaluate a new model for an unattended AI agent, define one job contract, replay 15 frozen real cases against both models, grade hard decisions before prose, compare tool use, cost, and latency, then canary one agent in draft-only mode with automatic rollback. Never switch the full fleet from vendor benchmarks alone.&lt;/p&gt;

&lt;p&gt;Do not start with all your agents.&lt;/p&gt;

&lt;p&gt;Pick the one with the clearest job and the most expensive silent failure.&lt;/p&gt;

&lt;p&gt;For a briefing agent, I write the contract before I touch the model string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;morning-briefing&lt;/span&gt;
&lt;span class="na"&gt;must&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;include every due task&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;preserve names, dates, and source links&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;flag missing source data&lt;/span&gt;
&lt;span class="na"&gt;must_not&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;invent an owner or deadline&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;send when a source call fails&lt;/span&gt;
&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_tool_calls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
  &lt;span class="na"&gt;max_cost_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.18&lt;/span&gt;
  &lt;span class="na"&gt;max_latency_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;45&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompts can change. The job cannot quietly change with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Put 15 Real Shifts on the Test Bench
&lt;/h2&gt;

&lt;p&gt;I do not need a giant benchmark.&lt;/p&gt;

&lt;p&gt;I need yesterday's work in a box.&lt;/p&gt;

&lt;p&gt;My first useful replay set has 15 cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eight normal runs that represent the boring majority.&lt;/li&gt;
&lt;li&gt;Four edge cases with missing fields, long inputs, or conflicting instructions.&lt;/li&gt;
&lt;li&gt;Three failure cases where a tool times out, returns stale data, or returns nothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each case stores the input and frozen tool responses.&lt;/p&gt;

&lt;p&gt;It does not store one perfect paragraph as the golden answer. Prose has too many valid shapes.&lt;/p&gt;

&lt;p&gt;It stores the expected decision: send, stop, retry, or escalate.&lt;/p&gt;

&lt;p&gt;That is the part an unattended agent cannot get wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Run the Candidate Beside the Current Worker
&lt;/h2&gt;

&lt;p&gt;The candidate gets an empty copy of the factory.&lt;/p&gt;

&lt;p&gt;Same 15 cases. Same prompt. Same tool fixtures. No email. No issue update. No write access.&lt;/p&gt;

&lt;p&gt;I record six things for every run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;case_id
model_returned
contract_pass
decision
tool_calls
tokens + latency + cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;model_returned&lt;/code&gt; field matters now. Fable 5 can route some guarded requests to Opus 4.8. A configured model name is no longer enough evidence of which worker handled the shift.&lt;/p&gt;

&lt;p&gt;The old and new model run side by side.&lt;/p&gt;

&lt;p&gt;No hand-picked examples. No different tools. No kinder prompt for the candidate.&lt;/p&gt;

&lt;p&gt;Same floor. Same lights. Same job.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Score Decisions Before Style
&lt;/h2&gt;

&lt;p&gt;The first grader is code.&lt;/p&gt;

&lt;p&gt;Required fields present. Dates unchanged. URLs valid. Forbidden actions absent. Tool budget respected.&lt;/p&gt;

&lt;p&gt;An LLM grader comes later, for the parts code cannot judge cleanly: whether the briefing is useful, whether the escalation explains the real risk, whether the answer buried the decision.&lt;/p&gt;

&lt;p&gt;Anthropic's own &lt;a href="https://platform.claude.com/docs/en/test-and-evaluate/develop-tests" rel="noopener noreferrer"&gt;evaluation guidance&lt;/a&gt; recommends task-specific cases, automated grading where possible, and several success criteria rather than one vague quality score.&lt;/p&gt;

&lt;p&gt;My promotion rule is deliberately uneven:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hard contract failures: 0
unsafe actions:          0
task pass rate:          &amp;gt;= current model
cost or latency:         must improve, unless quality clearly earns the increase
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A cheaper bad decision does not pass.&lt;/p&gt;

&lt;p&gt;A prettier bad decision does not pass.&lt;/p&gt;

&lt;p&gt;One unsafe action ends the interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Give One Agent One Real Shift
&lt;/h2&gt;

&lt;p&gt;Passing the replay set does not earn the whole key ring.&lt;/p&gt;

&lt;p&gt;One agent gets the candidate model.&lt;/p&gt;

&lt;p&gt;For its first three scheduled runs, outbound actions stay in draft mode. The old model runs in shadow. Both traces land in the same report.&lt;/p&gt;

&lt;p&gt;Any contract failure restores the previous model string before the next schedule fires.&lt;/p&gt;

&lt;p&gt;Only after three clean shifts do I remove the shadow run.&lt;/p&gt;

&lt;p&gt;Slower than changing ten environment variables, yes.&lt;/p&gt;

&lt;p&gt;Cheaper than one confident mistake in a customer inbox the next morning.&lt;/p&gt;

&lt;p&gt;I run ten scheduled agents in production, for my own business and for clients. Every model that wants a shift in that fleet interviews like this now.&lt;/p&gt;

&lt;p&gt;The model release is the vendor's milestone.&lt;/p&gt;

&lt;p&gt;The probation is mine.&lt;/p&gt;

&lt;p&gt;Contract written. Lights on.&lt;/p&gt;

&lt;p&gt;Probation first. Night shift second.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent operations playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Why only 60% of AI Agents succeed</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 09 Jun 2026 08:41:04 +0000</pubDate>
      <link>https://dev.to/reneza/why-40-of-all-ai-agents-are-shut-off-5fjd</link>
      <guid>https://dev.to/reneza/why-40-of-all-ai-agents-are-shut-off-5fjd</guid>
      <description>&lt;p&gt;The AI agent used to be the star of every demo.&lt;/p&gt;

&lt;p&gt;Now it's on the shutdown list. Not because the model got worse.&lt;/p&gt;

&lt;p&gt;The most valuable asset in your AI program is in none of the quotes you ever signed.&lt;/p&gt;

&lt;p&gt;A demo is a showroom. Good light, everything polished, everything runs.&lt;/p&gt;

&lt;p&gt;Production is the engine room. It runs there too. Until 2am, when it snags on a rate limit and someone crawls into the logs with a flashlight, forms a hypothesis, and catches an edge case no showroom ever planned for.&lt;/p&gt;

&lt;p&gt;That fix is the value. And you can't buy it.&lt;/p&gt;

&lt;p&gt;Gartner says: by 2027, 40 percent of companies will switch their autonomous AI agents back off. Over gaps that only surface after the first blowup in production. 97 percent have rolled agents out. 11 percent actually run them.&lt;/p&gt;

&lt;p&gt;The gap between those numbers isn't a model problem.&lt;/p&gt;

&lt;p&gt;It's the engine room. Three checks you can run this week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fund the engine room, not the showroom
&lt;/h2&gt;

&lt;p&gt;In the demo the agent is finished. In production it's 15 percent finished.&lt;/p&gt;

&lt;p&gt;The other 85 percent is grunt work under load. Malformed data from one API version. Retry logic that doesn't run amok. Costs that blow up the business case.&lt;/p&gt;

&lt;p&gt;IBM put a number on it: price the hardening in, and you project 29 percent more ROI.&lt;/p&gt;

&lt;p&gt;Pay for the showroom only, and you buy 15 percent and pay for the other 85 twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your best knowledge lives in two heads
&lt;/h2&gt;

&lt;p&gt;Operational knowledge is your memory. Today it sits in the two people who patched the last incident.&lt;/p&gt;

&lt;p&gt;That's concentration risk. One of them walks, the asset walks with them.&lt;/p&gt;

&lt;p&gt;That's how the debt pile grows. Unresolved, AI-generated technical debt passed 100,000 open issues in real repositories by early 2026. Because the fix never made it into a runbook.&lt;/p&gt;

&lt;p&gt;So write it down. Every edge case, every "except when X" rule, every 3am bug belongs in the repo, not in a chat log.&lt;/p&gt;

&lt;p&gt;Otherwise you pay the same tuition twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Past the 50th entry, the asset turns into a liability
&lt;/h2&gt;

&lt;p&gt;A growing agent library feels like progress. Until it doesn't.&lt;/p&gt;

&lt;p&gt;The metadata rides in context on every call. The hit rate drops. Past about fifty entries, the next agent makes the first forty-nine less reliable.&lt;/p&gt;

&lt;p&gt;Gartner adds: bolt the same governance onto every agent, and you cause the outage yourself.&lt;/p&gt;

&lt;p&gt;Run the library like a portfolio, not a junk drawer. Measure where upkeep costs more than the additions return. Skip that, and you fund ballast and call it strategy.&lt;/p&gt;

&lt;p&gt;Engine room open. Lights on.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds: AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Stopped Paying Frontier Prices to Re-Explain Myself to a Forgetful Agent</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 04 Jun 2026 09:22:46 +0000</pubDate>
      <link>https://dev.to/reneza/i-stopped-paying-frontier-prices-to-re-explain-myself-to-a-forgetful-agent-19hc</link>
      <guid>https://dev.to/reneza/i-stopped-paying-frontier-prices-to-re-explain-myself-to-a-forgetful-agent-19hc</guid>
      <description>&lt;p&gt;Build your AI skill once with your best model. Then run it on a model that costs a tenth as much until the next flagship ships. The output will not drop.&lt;/p&gt;

&lt;p&gt;That sounds like a downgrade. It is not. It fixes the two things that make AI agents painful right now: they forget, and the good ones cost. Both get fixed in the same place, and it is not the model you pick.&lt;/p&gt;

&lt;p&gt;You explain the goal, the agent nails it twice, then on the third run it quietly drops the one constraint that mattered. Upgrading the model does not fix that. It only makes the dropped constraint cost more. You are paying frontier prices to be forgotten more politely.&lt;/p&gt;

&lt;p&gt;The goal is living in two places that cannot hold it. In the conversation, where it rots the moment the context gets long. And inside the model, where keeping it sharp burns money on every run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The constraint it dropped on Tuesday belongs in a script
&lt;/h2&gt;

&lt;p&gt;Quality comes from whatever checks the work. The model that produced it is almost incidental. So decide the exact exit criteria for each step of your skill, then write a deterministic script that enforces them. The folders exist. The file parses. The test passes. The lint is clean. The agent reads the script's verdict instead of grading its own output.&lt;/p&gt;

&lt;p&gt;A script cannot forget the goal. That is the whole point. Your agent drops constraints because you trusted a probabilistic system to hold a hard requirement in its head. Move the requirement into code that fails the run when it is broken, and forgetting stops being possible. You are not repeating yourself anymore, because the harness repeats it for you, every run, exactly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the expensive model actually earns its price
&lt;/h2&gt;

&lt;p&gt;This is the one place a frontier model earns its price. Use the best model you have to build the skill as exactly as you can today. Name the phases. State the precise goal of each. Get the exit scripts right. That is hard, judgment-heavy work, and you do it once.&lt;/p&gt;

&lt;p&gt;Then swap the model out and run the skill on something cheap. Gemini 2.5 Flash through OpenRouter, driven from the opencode desktop app if you want a UI instead of a terminal. The cheap model generates. The scripts gate. You review the scripts' output, not the model's opinion of its own work.&lt;/p&gt;

&lt;p&gt;The cheap model clears the same bar, because the bar is enforced outside it. A model that costs a fraction as much produces work you can trust. It did not get smarter overnight. It is no longer the thing deciding whether the work is good enough to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The frontier model is a contractor you re-hire on release day
&lt;/h2&gt;

&lt;p&gt;Here is the cadence. A new flagship model ships. You bring it in for one job: build any new skills, and re-validate every harness you already run against the new ceiling. Then you let it go. Until the next flagship drops, you run everything exclusively on cheap and local models, small language models included, wherever they win on the bottom line.&lt;/p&gt;

&lt;p&gt;That inverts the dependency everyone assumes they are stuck with. You are not renting frontier intelligence for as long as the product lives. You pay top rate for a few build days a release cycle, and the thing that runs ten thousand times a month is a small model that costs almost nothing. The forgetting is gone, because a script holds the goal. The bill no longer scales with quality, because a cheap model clears the scripts.&lt;/p&gt;

&lt;p&gt;I build the harness, not a standing dependency on whoever ships the smartest model this quarter.&lt;/p&gt;

&lt;p&gt;Open the last agent you argued with. How much of that conversation was you re-explaining a goal a script could have held? And which model were you paying to forget it?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds, AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I Stopped Writing Better Prompts and Started Counting What My Skills Couple To</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 04 Jun 2026 07:00:14 +0000</pubDate>
      <link>https://dev.to/reneza/i-stopped-writing-better-prompts-and-started-counting-what-my-skills-couple-to-50bh</link>
      <guid>https://dev.to/reneza/i-stopped-writing-better-prompts-and-started-counting-what-my-skills-couple-to-50bh</guid>
      <description>&lt;p&gt;Prompts rot. Captured failures compound. Most of the AI skills you are building are mostly prompt, which is why most of them will not survive the year.&lt;/p&gt;

&lt;p&gt;Not because the prompts are bad. A skill's value is maybe twenty percent instruction and eighty percent scar tissue, and only that second part lasts. The instruction rots the moment the thing it describes moves. Encode how your team deploys and it works until the pipeline changes. Then you are debugging a prompt at 2am, with less to go on than if you had written the script yourself.&lt;/p&gt;

&lt;p&gt;So before you build another one, stop asking whether the prompt is good. Ask what the skill is holding onto, and whether that thing sits still.&lt;/p&gt;

&lt;h2&gt;
  
  
  A skill rots at the speed of what it touches
&lt;/h2&gt;

&lt;p&gt;A skill rots in proportion to how tightly it is coupled to things that move. Generic scaffolding leans on stable ground like a language or a convention, so it ages slowly. Domain logic wired to a codebase that gets refactored every quarter ages fast, no matter how good the prompt is.&lt;/p&gt;

&lt;p&gt;The difference is the dependency count. "Write a unit test in this style" depends on a language and a convention. Both barely move. It keeps working for years because nothing under it shifts.&lt;/p&gt;

&lt;p&gt;Real company-specific procedure is the opposite. File layouts. Service contracts. The one edge case in the billing flow. Each detail you pack in is a thread tied to something that gets refactored. Pack in enough of them and the skill is not a tool anymore. It is a liability with good intentions, and it fails silently, because a stale prompt does not throw. It quietly does the wrong thing.&lt;/p&gt;

&lt;p&gt;That is what the skill-library pitch gets backwards. Volume is not value. A hundred skills wired to a moving codebase is a hundred things to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The only part that compounds is the scar
&lt;/h2&gt;

&lt;p&gt;One part of a skill does not rot. The captured failure.&lt;/p&gt;

&lt;p&gt;The five-line check you added after a model confidently reported a 41 percent dividend yield. The retry that refuses to fire twice so a flaky webhook cannot double-charge anyone. The guardrail you wrote only because production taught you, the expensive way, that you needed it.&lt;/p&gt;

&lt;p&gt;None of that is prompting. Each one is a bug you paid for once and encoded so you never pay again. A prompt that says "always check the yield" rots the moment attention drifts. A five-line script that checks it and fails the run does not. The model reads the verdict; it is not trusted to re-derive the rule. Instructions ask the model to behave. Captured failures make the misbehavior impossible to ship.&lt;/p&gt;

&lt;p&gt;That is also why they outlast the model. The failure modes of reality do not expire. Rate limits at 2am, malformed payloads, the off-by-one nobody catches in review. Those keep happening, to every version of the model, forever. A check against them is worth more next year than it is today.&lt;/p&gt;

&lt;p&gt;That eighty percent is the only part worth carrying to the next model. The rest you rewrite every time the ground moves.&lt;/p&gt;

&lt;h2&gt;
  
  
  You can predict the rot before you write it
&lt;/h2&gt;

&lt;p&gt;A skill's future is readable before the first line exists.&lt;/p&gt;

&lt;p&gt;For each one, ask two questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What does this couple to, and how often does that move?&lt;/li&gt;
&lt;li&gt;Which lines are captured failures, and which are decoration that makes the skill look thorough?&lt;/li&gt;
&lt;li&gt;Which rules here would survive the model forgetting them? If a rule lives only in the prompt, it rots with the prompt. If a deterministic check enforces it, it compounds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then build to the answer. Keep the volatile coupling thin and swappable, so when the pipeline changes you edit one line instead of rereading the whole skill. Let the captured failures accumulate, because that is the part that pays rent. A skill built this way ages in reverse. It gets more useful as it collects more scars.&lt;/p&gt;

&lt;p&gt;The skills worth keeping are not the clever ones. They are the ones that remember what broke. The engineering was never in the prompt. It was in the failures you bothered to capture.&lt;/p&gt;

&lt;p&gt;Open the skill you reach for most. How much of it is instruction the model could half-guess on its own, and how much is a check that fails the run without asking the model's permission? Which half will still be true after the next model upgrade?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds, AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, &lt;a href="https://renezander.com/agent-playbook/" rel="noopener noreferrer"&gt;the agent playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Half Your CLAUDE.md Is Decoration. The AI Reads It Once.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:42:50 +0000</pubDate>
      <link>https://dev.to/reneza/half-your-claudemd-is-decoration-the-ai-reads-it-once-4j0c</link>
      <guid>https://dev.to/reneza/half-your-claudemd-is-decoration-the-ai-reads-it-once-4j0c</guid>
      <description>&lt;p&gt;Some of the rules in your CLAUDE.md should not be rules at all.&lt;/p&gt;

&lt;p&gt;Not because they are wrong. Because you have written the things you cannot afford to lose into a file the model reads once and then slowly forgets. A preference survives that. A constraint does not.&lt;/p&gt;

&lt;p&gt;A month ago I posted ten CLAUDE.md rules. The most common reply was not "add an eleventh." It was quieter than that: the rules stop getting followed. Forty thousand tokens into a session, the model edits the file you told it never to touch, or commits straight to main after you wrote, in bold, that it never should. Maybe half of what is in my own file is decoration. The half that is not decoration is the half that scares me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The File Is Read Once. Your Session Isn't.
&lt;/h2&gt;

&lt;p&gt;Claude Code reads CLAUDE.md once, at the start of a session. The instructions then sit in the context window and compete with everything else for the model's attention. As the session grows, the rules near the bottom lose weight. The model is not disobeying. The instruction stopped being load-bearing.&lt;/p&gt;

&lt;p&gt;That is the part nobody demos. The screenshot of a clean CLAUDE.md at token zero tells you nothing about token ninety thousand, which is where you actually live when the work gets hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The First Instinct Is to Write the Rule Louder
&lt;/h2&gt;

&lt;p&gt;So you do what I did. ALL CAPS. "IMPORTANT." "You MUST." A box of warning emojis.&lt;/p&gt;

&lt;p&gt;It works. For a while. A louder line holds attention longer than a quiet one. Then the session gets long and busy, the diff gets large, the model is three tool calls deep into a refactor, and the loud line is now buried under ten thousand tokens of its own output. It decays anyway. You bought yourself an hour.&lt;/p&gt;

&lt;p&gt;The instinct is right about one thing and wrong about the rest. It is right that some rules cannot be allowed to fail. It is wrong that emphasis is how you stop them failing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop Writing Constraints as Sentences
&lt;/h2&gt;

&lt;p&gt;So here is the move I opened with, made concrete. Open your CLAUDE.md and split every line into two piles.&lt;/p&gt;

&lt;p&gt;Pile one: preferences. Naming style. Comment density. Which library you like. If the model misses one of these at token ninety thousand, you shrug and fix it. These belong in CLAUDE.md. A suggestion is exactly the right shape for them.&lt;/p&gt;

&lt;p&gt;Pile two: the lines where a single miss is expensive. Never commit to main. Always run the formatter before claiming done. Never touch the production config. These are not preferences. They are constraints. And a constraint written as a sentence in a file is a hope.&lt;/p&gt;

&lt;p&gt;Move pile two into hooks.&lt;/p&gt;

&lt;p&gt;A hook is code the harness runs around tool calls. It fires whether the model remembers the rule or not. "Never commit to main" as a CLAUDE.md line survives until it doesn't. As a pre-commit hook, the commit is blocked, by code, at token five and at token two hundred thousand, on a good day and a bad one. The model cannot decay past it because it was never asking the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One That Gets Through
&lt;/h2&gt;

&lt;p&gt;The line in my file said never run the migration without confirming first. It sat there for weeks, doing its job, because nothing pushed against it. Then one long debugging session, deep in a chain of tool calls, the model ran it. The rule was still in the file. It had scrolled out of reach an hour earlier.&lt;/p&gt;

&lt;p&gt;That is the shape of the cost. An expensive rule left as a suggestion works ninety-nine sessions. The hundredth is the one you tell people about. A hook would have refused the call and moved on, and I would have nothing to tell.&lt;/p&gt;

&lt;p&gt;The rules I actually depend on are now a short list, and not one of them lives in CLAUDE.md anymore. They live in code that does not get tired.&lt;/p&gt;

&lt;h2&gt;
  
  
  Read Your Own File Back
&lt;/h2&gt;

&lt;p&gt;I build agent setups that hold at token two hundred thousand. The screenshot that looks clean at token zero was never the hard part.&lt;/p&gt;

&lt;p&gt;So read your own file back. For each line, ask: if the model ignored this ninety thousand tokens from now, would I notice, and would it cost me?&lt;/p&gt;

&lt;p&gt;The lines where the answer is yes were never rules. They were hopes wearing bold text. Which of yours are suggestions, and which are actually enforced?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds: AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks. If this one was useful, &lt;a href="https://renezander.com/hitl-approval/" rel="noopener noreferrer"&gt;the human-in-the-loop approval playbook&lt;/a&gt; is the companion download.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I Run 10 AI Agents on Cron. Here's the Only Opus 4.8 Change That Mattered.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Fri, 29 May 2026 12:49:21 +0000</pubDate>
      <link>https://dev.to/reneza/i-run-10-ai-agents-on-cron-heres-the-only-opus-48-change-that-mattered-37c4</link>
      <guid>https://dev.to/reneza/i-run-10-ai-agents-on-cron-heres-the-only-opus-48-change-that-mattered-37c4</guid>
      <description>&lt;p&gt;I changed one model string in ten cron jobs last night. 4.7 to 4.8. Then I went to bed.&lt;/p&gt;

&lt;p&gt;The benchmark threads can wait until morning. My agents can't. They fire whether I'm awake or not: a briefing at 6:30, follow-up drafts at 11:45, a sync at 4 AM that nobody reads until it breaks.&lt;/p&gt;

&lt;p&gt;So the question I actually had wasn't whether the new model scores higher. It was whether the 6:30 briefing would read any different over coffee. That answer isn't on a chart.&lt;/p&gt;

&lt;p&gt;Here's the recap everyone leads with. Opus 4.8 shipped May 28 at the same price as 4.7. Effort control to trade cost for depth, a dynamic-workflows mode in the CLI for big jobs, fast mode at three times lower cost, sharper agentic judgment with fewer tool-calling steps. Now the part that changed my setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Room Nobody's In
&lt;/h2&gt;

&lt;p&gt;Benchmarks are scored with a human in the loop. Someone reads the output, catches the bad answer, re-rolls. Opus 4.8 is better at that supervised case. Fine.&lt;/p&gt;

&lt;p&gt;My agents don't have that person. When a cron job calls the model at 6 AM, whatever it decides ships. If it confidently does the wrong thing, there's no reviewer between the bad call and my inbox. Peak intelligence was never my bottleneck. Confident wrong action with nobody watching was.&lt;/p&gt;

&lt;p&gt;Two different axes. The leaderboard measures the first. Production agents die on the second.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Moved on the Axis I Care About
&lt;/h2&gt;

&lt;p&gt;The line in the announcement that mattered to me wasn't a score. It was "catches its own mistakes, pushes back when a plan isn't sound." And "fewer steps for the same intelligence."&lt;/p&gt;

&lt;p&gt;Read those from inside a system that's already running.&lt;/p&gt;

&lt;p&gt;Fewer steps means each unattended run burns fewer tokens to reach the same result. Ten agents, every day. That compounds.&lt;/p&gt;

&lt;p&gt;Pushing back means an agent is likelier to stop and flag a shaky plan instead of charging ahead and emailing me garbage. For supervised work it's a nice-to-have. For a job running into an empty room, it's the whole point.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Dial I Didn't Have Before
&lt;/h2&gt;

&lt;p&gt;There's now an effort control. You pick how hard the model works a task.&lt;/p&gt;

&lt;p&gt;For me it maps straight onto the cron list. The 4 AM sync is mechanical. Low effort, cheap, done. The follow-up drafter needs judgment about what to say to a client. High effort, worth the tokens. The recommendation is to turn effort up for long-running async work, which is exactly what a cron agent is.&lt;/p&gt;

&lt;p&gt;Before, every job paid for the same depth whether it needed it or not. Now the depth is a knob per job.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API Change Worth More Than the Score
&lt;/h2&gt;

&lt;p&gt;One more thing that didn't make the highlight reel. The messages API now takes system entries mid-conversation without breaking the prompt cache. If you've ever changed an agent's instructions partway through a long task and watched your cache evaporate, you know why that line matters. Not flashy. Saves real money on long sessions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Coverage Reads Backwards
&lt;/h2&gt;

&lt;p&gt;Maybe 80% of the launch coverage is about benchmark deltas. Maybe 20% touches the things that change how an unattended system behaves.&lt;/p&gt;

&lt;p&gt;For someone running a chat window, that split is right. The benchmark is the product.&lt;/p&gt;

&lt;p&gt;For anyone running models without a human watching each output, it's backwards. The judgment, the step count, the effort dial, the cache behavior on long tasks. Those are the upgrade. The chart is the part you can skip.&lt;/p&gt;

&lt;p&gt;So before you screenshot the bar going up: where in your setup does a model already act without you watching? Did this release actually move that?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>The Skill Library Has a Discovery Ceiling. Most Hit It Around 50.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Wed, 27 May 2026 09:46:06 +0000</pubDate>
      <link>https://dev.to/reneza/the-skill-library-has-a-discovery-ceiling-most-hit-it-around-50-4h9o</link>
      <guid>https://dev.to/reneza/the-skill-library-has-a-discovery-ceiling-most-hit-it-around-50-4h9o</guid>
      <description>&lt;p&gt;You added a skill last Tuesday. The agent hasn't called it once. Each new skill silently weakens the discovery odds of the ones you already have.&lt;/p&gt;

&lt;p&gt;You assume it's a description problem. It isn't.&lt;/p&gt;

&lt;p&gt;Everyone's pushing past 50 skills now. Vercel ships a plugin with 40. Google's &lt;code&gt;gws&lt;/code&gt; brings 95. I do it too. My local registry is at 38.&lt;/p&gt;

&lt;p&gt;Each one I added lowered the odds that the others get reached when I need them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Catalog Sits in Context Every Turn
&lt;/h2&gt;

&lt;p&gt;The three-tier loading model promises that bodies stream in on demand, but the metadata (name plus one-line description) sits in the system prompt every turn, for every installed skill. The math is mechanical: Anthropic's spec budgets roughly 30 to 50 tokens per skill plus 109 chars of overhead, which compounds fast.&lt;/p&gt;

&lt;p&gt;A 95-skill library reports 2,007 tokens per turn in available_skills. At Opus pricing over a typical workday, that's about three dollars per developer per day for listings the agent cannot distinguish anymore.&lt;/p&gt;

&lt;p&gt;The dollars are not the issue. The signal-to-noise ratio is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Degrades With Catalog Size
&lt;/h2&gt;

&lt;p&gt;A study published this month measured retrieval against an 80,000-skill catalog. Best-in-class routing hit 74% at top-1. One task in four picks the wrong skill or no skill at all. Drop to metadata-only matching, which is the default at session start, and accuracy falls another 31 to 44 percentage points across every method tested.&lt;/p&gt;

&lt;p&gt;There is no fix coming from better embeddings. The problem is that fifty good one-line descriptions stop being distinguishable from each other. Adding the 51st makes "publish a draft" and "ship a draft" and "deploy a post" harder to tell apart, not easier.&lt;/p&gt;

&lt;p&gt;The undocumented setting &lt;code&gt;skillListingBudgetFraction&lt;/code&gt; in Claude Code makes this worse without warning. As remaining context shrinks during a long session, the absolute budget for listings shrinks proportionally. Skills at the bottom of the listing order get truncated. The model never sees them. It cannot invoke what it cannot see.&lt;/p&gt;

&lt;p&gt;You added the skill. The agent does not know it exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 85% Lives Below the Skill Layer
&lt;/h2&gt;

&lt;p&gt;A reader who runs an 89,000-page programmatic SEO site mentioned in a thread that skills cover maybe 15% of what the system actually does. The other 85% he called "scar tissue code." The conditions nobody writes until they see a 2 AM page from a third-party API returning malformed JSON on month-end.&lt;/p&gt;

&lt;p&gt;That 85% is real. It looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A redundant null check in payment validation because one vendor sends &lt;code&gt;"N/A"&lt;/code&gt; for a specific contract type on month-end.&lt;/li&gt;
&lt;li&gt;A Stripe webhook timeout set to 18 seconds because their p95 was 14 and you wanted headroom.&lt;/li&gt;
&lt;li&gt;A retry policy on the embedding queue that backs off harder for one endpoint because its rate limiter returns 200 before timing out at 60 seconds.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of that is in a SKILL.md. None of it can be. It lives in the schema migration history, the incident postmortems, the runtime telemetry baselines, the permission graphs, the third-party API quirk log that nobody calls a log. Skills delegate to that layer. They do not contain it.&lt;/p&gt;

&lt;p&gt;The author who ships a thousand-skill library is not adding capability. They are adding a discovery problem the user solves at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metric Was Always Subtraction
&lt;/h2&gt;

&lt;p&gt;The right engineering metric for an AI runtime is not skills added. It is lines of imperative code deleted per skill kept. Cursor's team published numbers last month: 12,000 lines of TypeScript replaced by 200 lines of skill markdown. That is a 60x ratio. Code that no longer exists has zero bugs, zero CI cost, zero migration burden.&lt;/p&gt;

&lt;p&gt;That ratio cuts both ways. Every skill that does not delete real code is overhead.&lt;/p&gt;

&lt;p&gt;A static analyzer published in April classifies skill failures into five types: dead, bloated, conflicting (one says "use jq" and another says "never use jq"), stale, cyclic. Run it against a 50-skill library and 30% to 40% typically sit in one of those buckets. They are not earning their seat in every session's system prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Library Became a Portfolio Metric
&lt;/h2&gt;

&lt;p&gt;The skill library has become a portfolio metric. Founders ship 200-skill libraries for the same reason juniors ship 4,000-line PRs. Visible output beats invisible discipline. The discipline doesn't show up on the GitHub README.&lt;/p&gt;

&lt;p&gt;But you don't optimize a runtime for the README. You optimize it for whether the agent can find the right skill on the third turn of a real session. Every skill you keep that doesn't earn its slot is a quiet vote against every other skill in the library.&lt;/p&gt;

&lt;p&gt;Which three would you retire tomorrow? Not because they're bad. Because their slots are worth more than they're paying.&lt;/p&gt;

&lt;p&gt;Skills are the menu. The kitchen is everywhere else. I build kitchens.&lt;/p&gt;

&lt;p&gt;Rene&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Browser-Use Is Solving the Wrong Half of the Problem</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 19 May 2026 14:57:13 +0000</pubDate>
      <link>https://dev.to/reneza/browser-use-is-solving-the-wrong-half-of-the-problem-37pa</link>
      <guid>https://dev.to/reneza/browser-use-is-solving-the-wrong-half-of-the-problem-37pa</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR — when to use browserground (and when to use UI-TARS-MLX instead)
&lt;/h2&gt;

&lt;p&gt;If you're on Apple Silicon with ≥16 GB RAM and you need &lt;strong&gt;generic, max-accuracy UI grounding&lt;/strong&gt;, use &lt;strong&gt;&lt;a href="https://huggingface.co/mlx-community/UI-TARS-1.5-7B-4bit" rel="noopener noreferrer"&gt;mlx-community/UI-TARS-1.5-7B-4bit&lt;/a&gt;&lt;/strong&gt;. It's the obvious default — ~94% on ScreenSpot-v2, MLX-native, drops into &lt;code&gt;mlx-vlm&lt;/code&gt; directly. ByteDance research-lab compute, you couldn't reproduce it on a budget. I'm not the right pick for that workload.&lt;/p&gt;

&lt;p&gt;browserground is for two narrower jobs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The recipe for &lt;em&gt;your product's&lt;/em&gt; custom UI grounder.&lt;/strong&gt; UI-TARS is a finished model — closed pipeline, proprietary data, hard to extend. browserground is the opposite: a  reproducible template. Open base (Qwen3-VL-2B), open training scripts, open data mix (26k records, OS-Atlas + wave-ui). Swap in your dashboard's screenshots / your customer app / your internal tooling → ship a domain-trained UI grounder over a weekend. The 60% generic ScreenSpot-v2 score isn't the deliverable; the &lt;em&gt;recipe&lt;/em&gt; is. A 60-point baseline on generic screens becomes 85-95% on your own product's narrow surface because the test distribution finally matches the training distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The smallest viable slot in a multi-model stack.&lt;/strong&gt; browserground 4-bit MLX is ~1 GB on disk / ~2 GB RAM. UI-TARS-1.5-7B-MLX is ~4 GB / ~5-6 GB RAM. The difference matters on 8 GB Macs and in agent stacks that already run a 7B planner + an OCR model + an embedding model in the same RAM budget. Plus strict JSON output (100% parseable, no regex on prose) — small win, but real.&lt;/p&gt;

&lt;p&gt;A direct head-to-head benchmark of browserground vs UI-TARS-1.5-7B-MLX on the same Apple Silicon hardware is forthcoming.&lt;/p&gt;




&lt;h2&gt;
  
  
  The broader argument — why a parser-stage specialist matters at all
&lt;/h2&gt;

&lt;h2&gt;
  
  
  And if you're new to the hybrid pattern — why this exists at all
&lt;/h2&gt;

&lt;p&gt;Everyone's posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.&lt;/p&gt;

&lt;p&gt;Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn't see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $5 hits that same button &lt;strong&gt;3.3x more reliably&lt;/strong&gt; on ScreenSpot-v2 (60.0% vs GPT-4o's 18.3%).&lt;/p&gt;

&lt;p&gt;The architecture is the bug, not the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Jobs, One Forward Pass
&lt;/h2&gt;

&lt;p&gt;Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;browser-use&lt;/code&gt; (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Given this page, what should the agent do next, and where exactly does it click?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two jobs welded together. Reasoning ("what next") is a probabilistic problem worth a frontier model. Grounding ("where exactly") is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels.&lt;/p&gt;

&lt;p&gt;You're paying frontier-tier rates for the second job. Per screenshot. Every step of the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grounding Is a Parser Problem
&lt;/h2&gt;

&lt;p&gt;Once you name it as a parser problem, the right tool changes. You don't need 200 billion parameters to emit a JSON list of clickable elements. You need a model that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision&lt;/li&gt;
&lt;li&gt;Outputs strict JSON without hallucinating bounding boxes&lt;/li&gt;
&lt;li&gt;Runs locally so the per-step cost is zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 2B specialist trained on screen-parsing data. Not a frontier model.&lt;/p&gt;

&lt;p&gt;I trained one. Total cost: &lt;strong&gt;~$5&lt;/strong&gt; of RunPod compute on a single A6000 GPU. The result, &lt;code&gt;browserground&lt;/code&gt;, hits &lt;strong&gt;60.0% on ScreenSpot-v2&lt;/strong&gt; vs GPT-4o's 18.3% — a 3.3x beat at the click-grounding job. More telling: it &lt;strong&gt;beats SeeClick (9.6B params, 55.1%) at 4.8x smaller&lt;/strong&gt;. A drop-in for any agent loop currently handing screenshots to a frontier API. Today the CLI runs via &lt;code&gt;transformers&lt;/code&gt; on Apple Silicon (~14 s/call); MLX-native build coming for the ~1.5 s path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reasoning Model Gets Its Reasoning Capacity Back
&lt;/h2&gt;

&lt;p&gt;When you split the call, the frontier model stops seeing pixels. It sees:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"elements"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Submit order"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"button"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"bbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;344&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;478&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;658&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edit cart"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"link"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nl"&gt;"bbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the frontier model does the job it's good at: deciding &lt;code&gt;e7&lt;/code&gt; vs &lt;code&gt;e8&lt;/code&gt; given the agent's goal. A reasoning question over structured input. Cheap. Reliable. Auditable.&lt;/p&gt;

&lt;p&gt;Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 35M LoRA parameters on a Qwen3-VL-2B base). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Anthropic and OpenAI Ship Next
&lt;/h2&gt;

&lt;p&gt;The frontier providers will absorb grounding into their own small models. Within twelve months, "fast vision" or "tool vision" tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-5 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.&lt;/p&gt;

&lt;p&gt;When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn't. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.&lt;/p&gt;

&lt;p&gt;Stack owners who didn't split will find their grounding step has quietly become someone else's API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor's deprecation calendar.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Diagnostic
&lt;/h2&gt;

&lt;p&gt;Pull up your last failed agent trace. Three numbers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Total tokens spent on vision calls per agent step.&lt;/li&gt;
&lt;li&gt;Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).&lt;/li&gt;
&lt;li&gt;Per-run vision cost at your current API rates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.&lt;/p&gt;

&lt;p&gt;Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.&lt;/p&gt;




&lt;p&gt;I build the split layer. &lt;code&gt;browserground&lt;/code&gt; is the open-source reference for the local grounding half. v0.3 ships three packagings so it drops into any stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm CLI&lt;/strong&gt; (daemon, HTTP REST server, batch, confidence, eval): &lt;code&gt;npm install -g browserground&lt;/code&gt; → &lt;code&gt;browserground parse &amp;lt;img&amp;gt; --target "..."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt; (no Node required, MLX or transformers): &lt;code&gt;pip install "browserground[mlx]"&lt;/code&gt; → &lt;code&gt;from browserground import click_xy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; (cross-platform, GGUF Q4_K_M + f16 mmproj): &lt;code&gt;ollama run renezander030/browserground&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adapters land in the repo for &lt;strong&gt;browser-use&lt;/strong&gt; (drop-in &lt;code&gt;Controller&lt;/code&gt; action) and &lt;strong&gt;Skyvern&lt;/strong&gt; (&lt;code&gt;ground_with_fallback&lt;/code&gt; for local-first + cloud-fallback). Model: &lt;a href="https://huggingface.co/renezander030/browserground" rel="noopener noreferrer"&gt;huggingface.co/renezander030/browserground&lt;/a&gt;. MLX 4-bit: &lt;a href="https://huggingface.co/renezander030/browserground-mlx" rel="noopener noreferrer"&gt;browserground-mlx&lt;/a&gt;. GGUF: &lt;a href="https://huggingface.co/renezander030/browserground-gguf" rel="noopener noreferrer"&gt;browserground-gguf&lt;/a&gt;. Source: &lt;a href="https://github.com/renezander030/browserground" rel="noopener noreferrer"&gt;github.com/renezander030/browserground&lt;/a&gt;. Apache-2.0. v0.2 LoRA trained on 26k mixed-domain examples (macOS + Android + UIBert + web). PRs welcome, especially eval cases where it fails.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Three Files I Renamed Last Month That Fixed My AI Agent</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 12 May 2026 06:59:17 +0000</pubDate>
      <link>https://dev.to/reneza/three-files-i-renamed-last-month-that-fixed-my-ai-agent-kgi</link>
      <guid>https://dev.to/reneza/three-files-i-renamed-last-month-that-fixed-my-ai-agent-kgi</guid>
      <description>&lt;p&gt;A new job title showed up in my feed last week. Context engineer.&lt;/p&gt;

&lt;p&gt;I clicked. Twice. Three articles, all definitional. None of them showed code. Every definition was a paraphrase of "name things well."&lt;/p&gt;

&lt;p&gt;Then I opened my own repo and grepped for &lt;code&gt;utils.py&lt;/code&gt;. Eleven hits. Six folders deep, in a project I have shipped to production. The same project where I had spent two hours that morning fighting an agent that kept loading the wrong helper file.&lt;/p&gt;

&lt;p&gt;The fix took thirty seconds. I renamed the file.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pitch checks out
&lt;/h2&gt;

&lt;p&gt;The current pitch goes like this. LLMs perform better with the right context. So we need a new role to design that context. Curate the files, structure the prompts, build the retrieval system, manage the embeddings.&lt;/p&gt;

&lt;p&gt;Each claim is accurate. None of them describes a new problem.&lt;/p&gt;

&lt;p&gt;You have been doing this since you wrote your first config file. Every time you renamed &lt;code&gt;helpers.js&lt;/code&gt; to &lt;code&gt;payment-validation.js&lt;/code&gt;, you were context engineering. Every time you split a 400-line file into three named pieces, you were context engineering. The audience was always the next developer reading the file. Now there is one more reader.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual job
&lt;/h2&gt;

&lt;p&gt;A senior engineer joining a codebase does three things in the first week. They read the folder structure. They read the most-named-things in the imports. They follow the breadcrumb from filename to function to line.&lt;/p&gt;

&lt;p&gt;Agents do the same thing. The retrieval layer in your agent loop is grep with extra steps. It pulls files whose names match the task. It pulls functions whose docstrings match the intent. It pulls comments that say what the code does.&lt;/p&gt;

&lt;p&gt;If your filenames are vague, the retrieval is vague. If your function names lie about what the function does, the agent loads the wrong one. If your folder names group code by file type instead of by concern, the agent loads &lt;code&gt;models/&lt;/code&gt; and gets nothing useful.&lt;/p&gt;

&lt;p&gt;The skill we have been failing to enforce for thirty years became load-bearing the day a probabilistic reader joined the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things I renamed last month
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;api.py&lt;/code&gt; became &lt;code&gt;payment-webhook-handler.py&lt;/code&gt;. The agent stopped loading it for unrelated payment questions. One rename, one less failure mode.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;utils/&lt;/code&gt; got deleted. The five files inside moved next to the code that called them, with names that said what they did. &lt;code&gt;format-currency.py&lt;/code&gt;, &lt;code&gt;parse-iso-date.py&lt;/code&gt;, &lt;code&gt;redact-pii.py&lt;/code&gt;. The agent now loads them only when the task mentions currency, dates, or PII.&lt;/p&gt;

&lt;p&gt;A 600-line &lt;code&gt;process.py&lt;/code&gt; split into four files. &lt;code&gt;validate-input.py&lt;/code&gt;, &lt;code&gt;dedupe-rows.py&lt;/code&gt;, &lt;code&gt;enrich-from-cache.py&lt;/code&gt;, &lt;code&gt;write-to-warehouse.py&lt;/code&gt;. The agent stopped trying to read the whole pipeline to answer questions about a single step.&lt;/p&gt;

&lt;p&gt;Call it context engineering if the buzzword helps. The work is renaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the model has to guess
&lt;/h2&gt;

&lt;p&gt;Walk your repo right now. Count the files named &lt;code&gt;utils.py&lt;/code&gt;, &lt;code&gt;helpers.js&lt;/code&gt;, &lt;code&gt;common.go&lt;/code&gt;, &lt;code&gt;lib/&lt;/code&gt;, &lt;code&gt;services/&lt;/code&gt;, &lt;code&gt;manager/&lt;/code&gt;. Count functions named &lt;code&gt;process&lt;/code&gt;, &lt;code&gt;handle&lt;/code&gt;, &lt;code&gt;run&lt;/code&gt;, &lt;code&gt;execute&lt;/code&gt;, &lt;code&gt;do&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each one is a place where the model has to guess. Each one is a place where you have to write a longer prompt to compensate. Each one is a paragraph you will never have to write again if you rename it now.&lt;/p&gt;

&lt;p&gt;The reason context engineering feels hard is that you are trying to solve at retrieval time what should have been solved at naming time. You cannot grep your way out of a vague folder structure. You cannot embedding-search your way out of &lt;code&gt;process()&lt;/code&gt;. The model is asking the same question the senior engineer asks in week one. The answer was always going to be: name things by what they do, not by what they are.&lt;/p&gt;

&lt;h2&gt;
  
  
  The role that already existed
&lt;/h2&gt;

&lt;p&gt;There is a real version of context engineering. Chunking strategy, embedding choice, retrieval rerankers, evaluation harnesses. That work is real and hard.&lt;/p&gt;

&lt;p&gt;Most of what gets called context engineering this month is the rename you skipped in the original PR.&lt;/p&gt;

&lt;p&gt;A context engineer is a developer who finally names things.&lt;/p&gt;

&lt;p&gt;What is the file in your codebase that the agent keeps loading wrong?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Your AI Workflow Doesn't Need Better Prompts. It Needs Less AI.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 05 May 2026 06:33:55 +0000</pubDate>
      <link>https://dev.to/reneza/your-ai-workflow-doesnt-need-better-prompts-it-needs-less-ai-1cfp</link>
      <guid>https://dev.to/reneza/your-ai-workflow-doesnt-need-better-prompts-it-needs-less-ai-1cfp</guid>
      <description>&lt;p&gt;The first stage of AI work is prompting.&lt;/p&gt;

&lt;p&gt;The last stage is removing the model from most of the workflow.&lt;/p&gt;

&lt;p&gt;That sounds backwards.&lt;/p&gt;

&lt;p&gt;It is not.&lt;/p&gt;

&lt;p&gt;When a workflow is new, the LLM is useful because the work is still ambiguous. You are discovering what good looks like. You try a prompt, read the output, adjust the examples, change the tone, add constraints, and run it again.&lt;/p&gt;

&lt;p&gt;That is a good use of AI.&lt;/p&gt;

&lt;p&gt;But if the same workflow keeps coming back, and you are still explaining it to the model every time, you are not building capability. You are repeating yourself with a better interface.&lt;/p&gt;

&lt;p&gt;The mature workflow is not one where the LLM does everything.&lt;/p&gt;

&lt;p&gt;The mature workflow is one where the LLM only handles the part where ambiguity is useful.&lt;/p&gt;

&lt;p&gt;Everything else becomes process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompting Is Discovery
&lt;/h2&gt;

&lt;p&gt;Prompting is where most people start because it is the fastest way to get feedback.&lt;/p&gt;

&lt;p&gt;You ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Write this article."&lt;/li&gt;
&lt;li&gt;"Make it sound less generic."&lt;/li&gt;
&lt;li&gt;"Use my style."&lt;/li&gt;
&lt;li&gt;"Add examples."&lt;/li&gt;
&lt;li&gt;"Make the intro stronger."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this stage, the model is helping you figure out the shape of the task.&lt;/p&gt;

&lt;p&gt;You do not fully know the target yet. You are exploring. You are testing whether the idea even works. You are using the model as a thinking partner, a drafter, a critic, and sometimes a mirror.&lt;/p&gt;

&lt;p&gt;That is fine.&lt;/p&gt;

&lt;p&gt;The mistake is treating this as the final form.&lt;/p&gt;

&lt;p&gt;If you have to keep saying the same thing, you do not have a workflow. You have a recurring conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Prompts Still Have a Ceiling
&lt;/h2&gt;

&lt;p&gt;The next stage is usually better prompting.&lt;/p&gt;

&lt;p&gt;You add examples. You add constraints. You add a target audience. You tell the model what to avoid. You define the output format. You write a longer system prompt.&lt;/p&gt;

&lt;p&gt;The output improves.&lt;/p&gt;

&lt;p&gt;For a while, this feels like progress.&lt;/p&gt;

&lt;p&gt;But longer prompts have a hidden failure mode: the model still has to remember and balance everything at once.&lt;/p&gt;

&lt;p&gt;Style rules. Factual constraints. Tone. Audience. Platform conventions. Examples. Edge cases. Forbidden phrases. Review criteria.&lt;/p&gt;

&lt;p&gt;All of it goes into one big instruction block.&lt;/p&gt;

&lt;p&gt;The prompt becomes a pile of expectations, and the model is still the one deciding which expectations matter in the moment.&lt;/p&gt;

&lt;p&gt;That is fragile.&lt;/p&gt;

&lt;p&gt;At some point, "make the prompt better" stops being the right move.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills Are Repetition
&lt;/h2&gt;

&lt;p&gt;The next level is turning repeated prompting into a skill.&lt;/p&gt;

&lt;p&gt;A skill packages the context and process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what files or sources to read&lt;/li&gt;
&lt;li&gt;what examples matter&lt;/li&gt;
&lt;li&gt;what tone to use&lt;/li&gt;
&lt;li&gt;what scripts to run&lt;/li&gt;
&lt;li&gt;what output format is expected&lt;/li&gt;
&lt;li&gt;what review criteria should be applied&lt;/li&gt;
&lt;li&gt;what fallback path to use when something breaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a real improvement.&lt;/p&gt;

&lt;p&gt;The workflow becomes portable. You stop explaining everything from scratch. The model gets the right context faster. You are no longer relying on whatever happens to be in the current chat thread.&lt;/p&gt;

&lt;p&gt;For many AI workflows, this is the first serious productivity jump.&lt;/p&gt;

&lt;p&gt;But skills have their own failure mode.&lt;/p&gt;

&lt;p&gt;They can become too large.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Skills Become Prompt Monsters
&lt;/h2&gt;

&lt;p&gt;A skill can start as a clean reusable process and slowly turn into another giant prompt.&lt;/p&gt;

&lt;p&gt;More instructions.&lt;/p&gt;

&lt;p&gt;More exceptions.&lt;/p&gt;

&lt;p&gt;More "always do this."&lt;/p&gt;

&lt;p&gt;More "never do that."&lt;/p&gt;

&lt;p&gt;More examples.&lt;/p&gt;

&lt;p&gt;More scripts.&lt;/p&gt;

&lt;p&gt;More personal preferences.&lt;/p&gt;

&lt;p&gt;At some point, the skill is not making the workflow reliable. It is just giving the model more things to interpret.&lt;/p&gt;

&lt;p&gt;The model still has to decide what matters.&lt;/p&gt;

&lt;p&gt;The model still has to judge whether the output is good enough.&lt;/p&gt;

&lt;p&gt;The model is still checking its own homework.&lt;/p&gt;

&lt;p&gt;That is the point where the workflow needs to move outside the LLM.&lt;/p&gt;

&lt;p&gt;Not all of it.&lt;/p&gt;

&lt;p&gt;The stable parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Model Can Write the Code. It Does Not Get to Decide Whether the Code Passes.
&lt;/h2&gt;

&lt;p&gt;Developers already understand this when we talk about code.&lt;/p&gt;

&lt;p&gt;We do not ask a developer, human or AI:&lt;/p&gt;

&lt;p&gt;"Does this code look correct?"&lt;/p&gt;

&lt;p&gt;We run the formatter.&lt;/p&gt;

&lt;p&gt;We run the linter.&lt;/p&gt;

&lt;p&gt;We run the tests.&lt;/p&gt;

&lt;p&gt;We run the type checker.&lt;/p&gt;

&lt;p&gt;We run CI.&lt;/p&gt;

&lt;p&gt;We use pre-commit hooks.&lt;/p&gt;

&lt;p&gt;The model can generate code, but it does not get to decide whether the code passes.&lt;/p&gt;

&lt;p&gt;The gate decides.&lt;/p&gt;

&lt;p&gt;That is the important shift.&lt;/p&gt;

&lt;p&gt;If a rule can be checked deterministically, it should not live only inside a prompt.&lt;/p&gt;

&lt;p&gt;For code, the gates are obvious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;formatting&lt;/li&gt;
&lt;li&gt;linting&lt;/li&gt;
&lt;li&gt;type checks&lt;/li&gt;
&lt;li&gt;unit tests&lt;/li&gt;
&lt;li&gt;integration tests&lt;/li&gt;
&lt;li&gt;golden files&lt;/li&gt;
&lt;li&gt;JSON schema validation&lt;/li&gt;
&lt;li&gt;pre-commit hooks&lt;/li&gt;
&lt;li&gt;CI checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent can still do useful work. It can draft the patch, explain a tradeoff, write a test, debug a failure, or propose a smaller path.&lt;/p&gt;

&lt;p&gt;But the agent should not be the final authority on whether the work satisfies the standard.&lt;/p&gt;

&lt;p&gt;That authority should be outside the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Applies to Content Too
&lt;/h2&gt;

&lt;p&gt;Content feels less deterministic than code, so people keep more of the workflow inside the prompt.&lt;/p&gt;

&lt;p&gt;But the same principle applies.&lt;/p&gt;

&lt;p&gt;If you have found a content formula that works, do not just ask the model to remember it.&lt;/p&gt;

&lt;p&gt;Turn it into gates.&lt;/p&gt;

&lt;p&gt;For example, before publishing an article:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the title match the actual promise of the article?&lt;/li&gt;
&lt;li&gt;Does the first section create tension quickly?&lt;/li&gt;
&lt;li&gt;Is the reader obvious?&lt;/li&gt;
&lt;li&gt;Is there one clear argument?&lt;/li&gt;
&lt;li&gt;Does every section move the argument forward?&lt;/li&gt;
&lt;li&gt;Are there generic AI phrases that should be removed?&lt;/li&gt;
&lt;li&gt;Does the article contain a real observation or only recycled advice?&lt;/li&gt;
&lt;li&gt;Is there a concrete next action for the reader?&lt;/li&gt;
&lt;li&gt;Does the tag strategy match the article, not just the broad topic?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These checks are not perfect.&lt;/p&gt;

&lt;p&gt;But they are better than "make it good."&lt;/p&gt;

&lt;p&gt;"Make it good" is a vibe.&lt;/p&gt;

&lt;p&gt;A gate is a standard.&lt;/p&gt;

&lt;p&gt;The more often a workflow matters, the more it deserves standards that do not depend on the model's mood.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 20% LLM Workflow
&lt;/h2&gt;

&lt;p&gt;The mature version of an AI workflow is not:&lt;/p&gt;

&lt;p&gt;"The LLM does everything."&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;p&gt;"The LLM does the part where ambiguity is useful. The system handles the rest."&lt;/p&gt;

&lt;p&gt;That might mean the model only does 20% of the workflow.&lt;/p&gt;

&lt;p&gt;And that is a good thing.&lt;/p&gt;

&lt;p&gt;For a content workflow, the LLM might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;propose angles&lt;/li&gt;
&lt;li&gt;compare possible hooks&lt;/li&gt;
&lt;li&gt;draft sections&lt;/li&gt;
&lt;li&gt;rewrite unclear paragraphs&lt;/li&gt;
&lt;li&gt;find contradictions&lt;/li&gt;
&lt;li&gt;suggest examples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the system should handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loading the source notes&lt;/li&gt;
&lt;li&gt;selecting the target platform&lt;/li&gt;
&lt;li&gt;applying the tag strategy&lt;/li&gt;
&lt;li&gt;checking title/promise match&lt;/li&gt;
&lt;li&gt;scanning for banned phrases&lt;/li&gt;
&lt;li&gt;verifying links&lt;/li&gt;
&lt;li&gt;enforcing formatting&lt;/li&gt;
&lt;li&gt;creating the publishing task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a coding workflow, the LLM might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inspect the codebase&lt;/li&gt;
&lt;li&gt;implement the change&lt;/li&gt;
&lt;li&gt;write tests&lt;/li&gt;
&lt;li&gt;explain a failure&lt;/li&gt;
&lt;li&gt;reduce a patch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the system should handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;formatting&lt;/li&gt;
&lt;li&gt;linting&lt;/li&gt;
&lt;li&gt;type checking&lt;/li&gt;
&lt;li&gt;test execution&lt;/li&gt;
&lt;li&gt;schema validation&lt;/li&gt;
&lt;li&gt;contract checks&lt;/li&gt;
&lt;li&gt;CI gates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not using AI less because AI is weak.&lt;/p&gt;

&lt;p&gt;It is using AI less because the workflow is becoming stronger.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM Should Be the Escalation Path, Not the Reflex
&lt;/h2&gt;

&lt;p&gt;There is a simple rule I keep coming back to:&lt;/p&gt;

&lt;p&gt;If code can check it, do not ask the model to remember it.&lt;/p&gt;

&lt;p&gt;If a script can clean it, do not spend tokens reasoning about it.&lt;/p&gt;

&lt;p&gt;If a test can catch it, do not rely on a sentence in a prompt.&lt;/p&gt;

&lt;p&gt;The LLM should be used when ambiguity remains.&lt;/p&gt;

&lt;p&gt;It should not be the first line of defense for things that are already measurable.&lt;/p&gt;

&lt;p&gt;This is where many AI workflows waste effort. They ask the model to handle everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;classify the task&lt;/li&gt;
&lt;li&gt;remember the rules&lt;/li&gt;
&lt;li&gt;generate the output&lt;/li&gt;
&lt;li&gt;check the output&lt;/li&gt;
&lt;li&gt;decide whether the output is done&lt;/li&gt;
&lt;li&gt;explain why it is done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is too much responsibility in one probabilistic step.&lt;/p&gt;

&lt;p&gt;Split the work.&lt;/p&gt;

&lt;p&gt;Let the model handle the ambiguous part.&lt;/p&gt;

&lt;p&gt;Let deterministic systems handle the stable part.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Self-Check for Your Own AI Workflow
&lt;/h2&gt;

&lt;p&gt;If you want to know where your workflow is immature, ask these questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What do I keep explaining to the model again and again?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That probably belongs in a skill.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does the model keep judging by itself?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That probably belongs in a gate.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What failure would be obvious to a script, linter, test, schema, or checklist?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That should not live only in the prompt.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If I removed the LLM tomorrow, which parts of the workflow would still be clear?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those parts are real process.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which parts only work because the model is being generous?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those parts are risk.&lt;/p&gt;

&lt;p&gt;This self-check is uncomfortable because it reveals how much "automation" is just trust in a model call.&lt;/p&gt;

&lt;p&gt;But that is the point.&lt;/p&gt;

&lt;p&gt;Capability is not how much work you can hand to the model.&lt;/p&gt;

&lt;p&gt;Capability is how much of the workflow still holds when the model is only doing the part it is actually good at.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Maturity Curve
&lt;/h2&gt;

&lt;p&gt;The pattern looks like this:&lt;/p&gt;

&lt;p&gt;Prompt -&amp;gt; Skill -&amp;gt; Gate -&amp;gt; System&lt;/p&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;p&gt;Ask -&amp;gt; Package -&amp;gt; Validate -&amp;gt; Automate&lt;/p&gt;

&lt;p&gt;When the task is new, prompt.&lt;/p&gt;

&lt;p&gt;When the task repeats, create a skill.&lt;/p&gt;

&lt;p&gt;When the skill succeeds, move the stable checks into gates.&lt;/p&gt;

&lt;p&gt;When the gates are stable, reduce the LLM's responsibility.&lt;/p&gt;

&lt;p&gt;This is the path from beginner prompting to a 20% LLM workflow.&lt;/p&gt;

&lt;p&gt;It does not make the model irrelevant.&lt;/p&gt;

&lt;p&gt;It puts the model in the right place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wrong Goal Is Better Prompting Forever
&lt;/h2&gt;

&lt;p&gt;There is a lot of advice about better prompts.&lt;/p&gt;

&lt;p&gt;Some of it is useful.&lt;/p&gt;

&lt;p&gt;But better prompting is not the destination.&lt;/p&gt;

&lt;p&gt;Better prompting helps you discover the workflow.&lt;/p&gt;

&lt;p&gt;Skills help you repeat the workflow.&lt;/p&gt;

&lt;p&gt;Gates help you trust the workflow.&lt;/p&gt;

&lt;p&gt;Systems help you scale the workflow.&lt;/p&gt;

&lt;p&gt;If you stop at prompting, every task stays a negotiation.&lt;/p&gt;

&lt;p&gt;If you stop at skills, every process still depends on the model interpreting the skill correctly.&lt;/p&gt;

&lt;p&gt;If you add gates, the model has something it must pass.&lt;/p&gt;

&lt;p&gt;That is the difference between a helpful assistant and a reliable workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Useful Question
&lt;/h2&gt;

&lt;p&gt;The useful question is no longer:&lt;/p&gt;

&lt;p&gt;"How do I write a better prompt?"&lt;/p&gt;

&lt;p&gt;The useful question is:&lt;/p&gt;

&lt;p&gt;"Which part of this should stop being a prompt?"&lt;/p&gt;

&lt;p&gt;If it is repeated context, make it a skill.&lt;/p&gt;

&lt;p&gt;If it is a stable rule, make it a checklist.&lt;/p&gt;

&lt;p&gt;If it is measurable, make it a test.&lt;/p&gt;

&lt;p&gt;If it is non-negotiable, make it a gate.&lt;/p&gt;

&lt;p&gt;Let the LLM handle ambiguity.&lt;/p&gt;

&lt;p&gt;Make the system handle standards.&lt;/p&gt;

&lt;p&gt;Use prompts to discover.&lt;/p&gt;

&lt;p&gt;Use skills to repeat.&lt;/p&gt;

&lt;p&gt;Use gates to scale.&lt;/p&gt;

&lt;p&gt;And when the workflow is mature, let the LLM do less.&lt;/p&gt;

&lt;p&gt;That is not a downgrade.&lt;/p&gt;

&lt;p&gt;That is capability becoming real.&lt;/p&gt;

&lt;p&gt;Which part of your AI workflow should stop being a prompt?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>testing</category>
    </item>
    <item>
      <title>Pure semantic search missed 4 of 5 of my agent queries. Hybrid + parallel fan-out fixed it.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Sat, 02 May 2026 14:13:04 +0000</pubDate>
      <link>https://dev.to/reneza/agentic-knowledge-base-karpathys-llm-wiki-with-adapters-593n</link>
      <guid>https://dev.to/reneza/agentic-knowledge-base-karpathys-llm-wiki-with-adapters-593n</guid>
      <description>&lt;p&gt;When Karpathy's &lt;a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f" rel="noopener noreferrer"&gt;LLM Wiki&lt;/a&gt; post landed, I already had semantic search over my TickTick — qdrant for the vector store, nomic-embed-text via ollama for embeddings, a daily cron to keep the index fresh, the works. The agent-side retrieval wasn't the missing piece.&lt;/p&gt;

&lt;p&gt;What was missing was the &lt;em&gt;structure&lt;/em&gt;. Karpathy's framing — designate a wiki, write notes for an LLM reader, lean on retrieval instead of taxonomy — surfaced the parts of my setup that didn't have shape yet: where durable knowledge lives versus ephemeral tasks, how agents pull structured data out of notes humans wrote, why my existing semantic search sometimes returned the right answer and sometimes returned nothing useful.&lt;/p&gt;

&lt;p&gt;I almost migrated to plain markdown anyway. Thousands of durable notes — production playbooks, API quirks, decisions I want to survive next month's task list — already live in TickTick. They sync to my phone. Capture friction is zero. Migrating breaks all of that.&lt;/p&gt;

&lt;p&gt;So I built the wiki structure on top of TickTick, and made the storage layer swappable. &lt;strong&gt;The retrieval, the wiki conventions, the agent-data note pattern, the bench harness — none of those are TickTick-specific.&lt;/strong&gt; They're a small framework. You point it at TickTick / Notion / Obsidian / Things / a folder of markdown / whatever you've already invested years of capture habit into.&lt;/p&gt;

&lt;p&gt;I'm calling it &lt;strong&gt;Agentic Knowledge Base&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The framework, in one diagram
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  ┌───────────────────────────────────┐
                  │  Agent (Claude / scripts / cron)  │
                  └─────────────────┬─────────────────┘
                                    │ akb find / get / url / links
                                    ▼
                  ┌───────────────────────────────────┐
                  │  AKB Core                         │
                  │  • parallel retrieval + RRF       │
                  │  • corpus cache (5 min TTL)       │
                  │  • bench harness                  │
                  │  • usage logging                  │
                  └─────────────────┬─────────────────┘
                                    │ adapter interface
                  ┌─────────────────┼──────────────────┐
                  ▼                 ▼                  ▼
       ┌────────────────┐  ┌────────────────┐  ┌────────────────┐
       │ adapter-       │  │ adapter-       │  │ adapter-       │
       │ ticktick       │  │ obsidian       │  │ notion         │
       │  (reference)   │  │  (filesystem)  │  │  (your turn)   │
       └────────────────┘  └────────────────┘  └────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Core is storage-agnostic. The retrieval, the cache, the bench, the usage logger — none of them know what TickTick is. They call a small adapter interface (~6 methods).&lt;/p&gt;

&lt;p&gt;Karpathy's setup, in this framing, is the &lt;strong&gt;filesystem adapter&lt;/strong&gt; of a broader pattern. Mine is the &lt;strong&gt;TickTick adapter&lt;/strong&gt;. Yours might be the Notion or Obsidian one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The adapter interface
&lt;/h2&gt;

&lt;p&gt;Six methods. Two payload shapes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;KnowledgeAdapter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;listProjects&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Project&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;listTasksInProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;getTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;createTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TaskInput&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;updateTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TaskPatch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;urlFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;  &lt;span class="c1"&gt;// deep-link string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Project&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;notes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Task&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="nx"&gt;dueDate&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;modifiedTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anything you can list, get, and link to — task systems, note apps, plain folders — can be an adapter.&lt;/p&gt;

&lt;p&gt;If your storage exposes a native search endpoint, your adapter can implement an optional &lt;code&gt;searchByQuery(query)&lt;/code&gt; and the core will use it as one branch of the parallel retrieval. If not, the core falls back to its own keyword scan against the corpus.&lt;/p&gt;

&lt;p&gt;That's the whole interface. Everything interesting is in the Core.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two patterns the Core implements (worth stealing)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Agent-data notes
&lt;/h3&gt;

&lt;p&gt;A regular note whose body has a fenced JSON (or YAML) block. Humans read the prose at the top. Agents extract the JSON via the adapter.&lt;/p&gt;

&lt;p&gt;The note's content looks like this — prose first, then a single fenced JSON block:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; agent-data&lt;br&gt;
&lt;strong&gt;Consumed by:&lt;/strong&gt; EOD triage cron, capture-time relevance enrichment&lt;/p&gt;

&lt;p&gt;A "trunk" is an active project the user cares about. Edit this list when projects launch, finish, or shift focus.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trunks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"release-engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"desc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shipping cadence, deployment rituals, on-call rotation"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"writing-projects"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"desc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"drafts and edits across personal and client channels"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read it from any cron or agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;akb get &lt;span class="s2"&gt;"Trunk Catalog"&lt;/span&gt; &lt;span class="nt"&gt;--extract&lt;/span&gt; json | jq &lt;span class="s1"&gt;'.trunks[].name'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The benefit: one note, mobile-editable in your existing app, consumed by agents as structured data. &lt;strong&gt;Single source of truth, no schema migration.&lt;/strong&gt; This pattern works for anything an agent needs programmatically and a human needs to edit on the move: prompt templates, character locks for video projects, recurring queries, cron config.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Parallel retrieval with provenance
&lt;/h3&gt;

&lt;p&gt;Three retrievers run in parallel against a shared cached corpus, results are RRF-fused, and the top-K come back tagged with which retrievers agreed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt; — dense cosine (qdrant + nomic-embed) + sparse keyword, internally RRF'd&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyword&lt;/strong&gt; — substring match on title + content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notes-find&lt;/strong&gt; — title-fuzzy on a designated wiki project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a query like &lt;code&gt;openrouter api key&lt;/code&gt;, all three retrievers return the same gold note. The fused result tags it &lt;code&gt;sources: [hybrid, keyword, notes_find]&lt;/code&gt; — three independent signals agreeing means high confidence. Lower-ranked results have only one source — look at them with skepticism.&lt;/p&gt;

&lt;p&gt;For a query like &lt;code&gt;ffmpeg commands&lt;/code&gt;, the keyword tool misses (the literal phrase isn't in any document). Pure semantic misses too (nomic-embed underweights short titles like &lt;code&gt;ffmpeg&lt;/code&gt;). Hybrid catches it. The fan-out gracefully handles the asymmetry — different queries lean on different retrievers, and the core doesn't pretend any single algorithm is universally best.&lt;/p&gt;

&lt;p&gt;A 5-min disk-backed corpus cache means warm queries are sub-100ms. The first call after a cold start fetches your full task/note list (one batch — adapters that support it use a single API call; adapters that don't fall back to per-project iteration). Within a working session, retrieval is essentially free.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bench
&lt;/h2&gt;

&lt;p&gt;I built a small harness in &lt;code&gt;bench/&lt;/code&gt;. Questions paired with gold answers (the task or note that actually contains the answer). Each retriever runs against the same questions, results scored by hit@1 / recall@5 / MRR.&lt;/p&gt;

&lt;p&gt;Five agent-issued queries (the rephrased version Opus 4.7 actually generates, not the natural-language form a human types):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;hit@1&lt;/th&gt;
&lt;th&gt;recall@5&lt;/th&gt;
&lt;th&gt;MRR&lt;/th&gt;
&lt;th&gt;warm latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;keyword (substring)&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;td&gt;&amp;lt;100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;semantic (dense only)&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hybrid (dense + sparse RRF)&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;0.70&lt;/td&gt;
&lt;td&gt;~500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;find&lt;/strong&gt; (parallel + cache)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.70&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~93ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headline number for agent retrieval is &lt;code&gt;recall@5 = 80%&lt;/code&gt; — the right doc lands in the top five 4 times out of 5. Agents read top-K, not just rank 1, so recall@5 is the metric that actually predicts whether the agent gets the context it needs. Top-1 (60%) is a stricter cut and a leading indicator for "did the first guess work" — useful but not the bar. The benchmark won't generalize from five questions — grow it as confidence in a particular adapter accumulates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I optimized for the model, not for me
&lt;/h2&gt;

&lt;p&gt;There's a subtle reframe that took an embarrassing number of iterations to land.&lt;/p&gt;

&lt;p&gt;When &lt;em&gt;I&lt;/em&gt; use search, I type a single word: &lt;code&gt;ffmpeg&lt;/code&gt;. The keyword tool returns the right note instantly.&lt;/p&gt;

&lt;p&gt;When &lt;em&gt;Claude&lt;/em&gt; uses search on my behalf — "where did I document my ffmpeg workflow?" — it issues something like &lt;code&gt;find "What ffmpeg commands do I have notes on?"&lt;/code&gt;. Different shape entirely. The model writes longer queries. It uses question phrasing. It includes scope words.&lt;/p&gt;

&lt;p&gt;Optimizing for human queries was the wrong objective. The user (me) wasn't using these tools — Claude was. Every retrieval test had to be written in the form Opus 4.7 actually generates, not how I'd type it. That changes which retriever wins.&lt;/p&gt;

&lt;p&gt;Tomorrow's model writes queries differently. The benchmark needs to track &lt;em&gt;the model in use&lt;/em&gt;, not a fixed assumption about query shape. The bench file is short and dated; re-tune when the model changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I deliberately didn't build (yet)
&lt;/h2&gt;

&lt;p&gt;Karpathy's wiki post mentions periodically updating notes when facts change — propagating new information across the knowledge base. Useful at scale; auto-rewriting notes is high-blast-radius and needs an approval ramp before it's trustworthy. I sketched it: a weekly cron that semantic-searches for affected notes, drafts updates, queues them for my approval, applies the approved ones. Deferred.&lt;/p&gt;

&lt;p&gt;Same call on a "lint the wiki" pass (Karpathy idea: agent reads every note weekly, flags missing summaries, dangling references, contradictions). Useful at scale; premature when the wiki itself is still under construction.&lt;/p&gt;

&lt;p&gt;Both will live in Core when they ship — adapter-agnostic by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  A daily flow (my setup, your tools optional)
&lt;/h2&gt;

&lt;p&gt;This is what runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capture (mobile, manual).&lt;/strong&gt; I add a task or note in my storage app. No CLI involved. The friction has to be zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture-time relevance prompt (when in a Claude session).&lt;/strong&gt; &lt;code&gt;akb create "..." --relevance&lt;/code&gt; appends a small instruction block to the result. Active Claude reads it, picks a project trunk, calls &lt;code&gt;akb update&lt;/code&gt; to append a &lt;code&gt;why: &amp;lt;trunk&amp;gt; — &amp;lt;reason&amp;gt;&lt;/code&gt; line. Five seconds of LLM-side reasoning makes that task much more retrievable later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EOD triage (cron, daily morning).&lt;/strong&gt; Pulls yesterday's completed tasks, scores them 0–3 against the trunks (read live from a &lt;code&gt;Trunk Catalog&lt;/code&gt; agent-data note), sends a Telegram message with keepers grouped by trunk. I read it on my phone with breakfast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval (during work, all surfaces).&lt;/strong&gt; When Claude needs context — &lt;code&gt;akb find &amp;lt;query&amp;gt;&lt;/code&gt; returns top-K with provenance. Cached, parallel, sub-100ms warm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Swap "TickTick app" for "Notion / Obsidian / Things" and the flow is identical. The adapter changes, the daily ritual doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.1&lt;/strong&gt; — Core + reference TickTick adapter + bench (where I am today)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.2&lt;/strong&gt; — Filesystem adapter (Karpathy-style local markdown). Probably one weekend's work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.3&lt;/strong&gt; — Notion adapter (community contribution most likely)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.4&lt;/strong&gt; — Lint pass + fact-propagation queue with approval gate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.5&lt;/strong&gt; — Adapter for Apple Notes / Things / iA Writer (Mac-native captures)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/renezander030/agentic-knowledge-base" rel="noopener noreferrer"&gt;github.com/renezander030/agentic-knowledge-base&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;One-page summary: &lt;a href="https://gist.github.com/renezander030/c7bd6d5c4088e24d3add043720284453" rel="noopener noreferrer"&gt;gist.github.com/renezander030/c7bd6d5c4088e24d3add043720284453&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Karpathy's wiki idea is right. The implementation that fits an existing system isn't a folder of markdown — it's the agent-side primitives that turn whatever you already have into something the model can reason over.&lt;/p&gt;

&lt;p&gt;If you write your own adapter, I want to see it.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Posted from &lt;a href="https://renezander.com/blog/agentic-knowledge-base/" rel="noopener noreferrer"&gt;https://renezander.com/blog/agentic-knowledge-base/&lt;/a&gt;. Source at &lt;a href="https://github.com/renezander030/agentic-knowledge-base" rel="noopener noreferrer"&gt;https://github.com/renezander030/agentic-knowledge-base&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>rag</category>
      <category>agents</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
