<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Fernando Rodriguez</title>
    <description>The latest articles on DEV Community by Fernando Rodriguez (@frr149).</description>
    <link>https://dev.to/frr149</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906301%2F4e64d5c4-6405-4465-9411-41b6e57e3818.jpg</url>
      <title>DEV Community: Fernando Rodriguez</title>
      <link>https://dev.to/frr149</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/frr149"/>
    <language>en</language>
    <item>
      <title>From /simplify to the Jedi Council: How I Built a Code Review with Kent Beck, Martin Fowler, and Mike Acton</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:21:52 +0000</pubDate>
      <link>https://dev.to/frr149/from-simplify-to-the-jedi-council-how-i-built-a-code-review-with-kent-beck-martin-fowler-and-16d9</link>
      <guid>https://dev.to/frr149/from-simplify-to-the-jedi-council-how-i-built-a-code-review-with-kent-beck-martin-fowler-and-16d9</guid>
      <description>&lt;p&gt;Claude Code includes a &lt;em&gt;slash command&lt;/em&gt; called &lt;code&gt;/simplify&lt;/code&gt; that automatically reviews your code. I ran it on a hefty diff — about 500 lines across 8 files — and the results were... interesting. It found things I wouldn’t have noticed, but it also wasted my time pointing out stuff that didn’t matter.&lt;/p&gt;

&lt;p&gt;So, I took it apart and rebuilt it piece by piece.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does /simplify Do?
&lt;/h2&gt;

&lt;p&gt;It’s a &lt;em&gt;skill&lt;/em&gt; that comes bundled with Claude Code (you don’t install it). It launches three agents in parallel, each looking at the same diff from a different angle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Code Reuse&lt;/strong&gt; — Are there existing utilities that could replace newly added code?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Quality&lt;/strong&gt; — Redundant state, &lt;em&gt;copy-paste&lt;/em&gt;, &lt;em&gt;leaky abstractions&lt;/em&gt;, &lt;em&gt;stringly-typed code&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency&lt;/strong&gt; — Unnecessary I/O, missed concurrency opportunities, &lt;em&gt;memory leaks&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The three produce findings, and then the system tries to fix the issues directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does Well
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;reuse&lt;/em&gt; agent caught a helper that was duplicated verbatim in two test suites. Same name, same lines, two different files. I moved it to a shared module. Nice and clean.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;efficiency&lt;/em&gt; agent spotted a double trip to disk inside a processing loop: load state, modify, save, read data, re-load, re-save. Two writes when one would suffice. I wouldn’t have noticed that myself.&lt;/p&gt;

&lt;p&gt;It also flagged a memory buffer that wasn’t cleaned up in the error path. If something failed between allocation and release, &lt;em&gt;leak&lt;/em&gt;. The main path was fine. Classic &lt;em&gt;copy-paste&lt;/em&gt; swallowing the detail.&lt;/p&gt;

&lt;p&gt;So far, so good. Three legitimate, actionable findings. But the problem with &lt;code&gt;/simplify&lt;/code&gt; isn’t what it catches — it’s everything else it reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Falls Short
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Too much noise in low-severity issues.&lt;/strong&gt; It suggested removing a field from a struct because it was “redundant” with a computed property. We’re talking 8 bytes. That field is used in more than 10 places in the code and the tests. The churn of removing it far outweighs the benefit of saving a single integer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No understanding of project context.&lt;/strong&gt; It flagged a concurrency pattern as HIGH risk, which is fair — that’s correct in the abstract. But it had already been documented in the project’s &lt;code&gt;CLAUDE.md&lt;/code&gt;, had a dedicated linter, was &lt;em&gt;allowlisted&lt;/em&gt;, and had an open issue. The agent didn’t know any of this because it works only with the diff, in complete isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Doesn’t distinguish "incorrect" from "improvable."&lt;/strong&gt; The double disk trip was inefficient but correct. The concurrency pattern was a latent bomb. Both came back as MEDIUM priority. The prioritization is flat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Suggests enums for external data.&lt;/strong&gt; It claimed that some fields in a DTO should be enums instead of strings. But those fields come from an external API. They’re only read and displayed. Turning them into enums requires &lt;em&gt;custom decoding&lt;/em&gt; and adds nothing — if the API sends a new value, your enum blows up instead of gracefully degrading.&lt;/p&gt;

&lt;p&gt;These are mistakes a developer with project context would filter out in two seconds. But &lt;code&gt;/simplify&lt;/code&gt; has no context. It has a diff and good intentions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Fixes I Made
&lt;/h2&gt;

&lt;p&gt;After reviewing the outputs, I identified three structural problems with &lt;code&gt;/simplify&lt;/code&gt; and fixed them in a custom &lt;em&gt;skill&lt;/em&gt; I called &lt;code&gt;/improve&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Inject Project Context
&lt;/h3&gt;

&lt;p&gt;Each agent receives the &lt;code&gt;CLAUDE.md&lt;/code&gt;, open issues from the tracker, and linter results before generating findings. If something is already managed, it mentions it but doesn’t report it as new.&lt;/p&gt;

&lt;p&gt;This eliminates the most irritating category of &lt;em&gt;false positives&lt;/em&gt;: the ones you already know about and have under control.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cost/Benefit Filtering
&lt;/h3&gt;

&lt;p&gt;Before reporting, each agent estimates how many files the fix would touch. If the effort-to-improvement ratio is negative — like renaming a field in 10+ spots for minor readability gains — it filters it out.&lt;/p&gt;

&lt;p&gt;This seems obvious, but &lt;code&gt;/simplify&lt;/code&gt; doesn’t do it. It treats a one-line change and a 15-file refactor with the same priority.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Separate "Auto-Fix" from "Backlog Issue"
&lt;/h3&gt;

&lt;p&gt;Findings are split into two types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;auto-fix&lt;/code&gt;&lt;/strong&gt;: Mechanical, ≤3 files, low risk. Applied directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;issue&lt;/code&gt;&lt;/strong&gt;: Requires design, touches &amp;gt;3 files, or changes an interface. Created as a tracker issue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents the review from attempting fixes that need more thought.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Didn’t Do (And Why)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A second LLM as a reviewer.&lt;/strong&gt; Sexy idea — &lt;em&gt;cross-model validation&lt;/em&gt;, more eyes, additional &lt;em&gt;training&lt;/em&gt;. In practice, the bottleneck isn’t the number of eyes but the quality of the context. A second model without access to the &lt;code&gt;CLAUDE.md&lt;/code&gt; or tracker spits out the same thing: generic “best practices” advice you can find in any book.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Categorization into 4 severity levels.&lt;/strong&gt; I started with CRITICAL/HIGH/MEDIUM/LOW, but with cost/benefit filtering active, almost everything that passes the filter is MEDIUM or HIGH. The other two categories are empty. More taxonomy doesn’t mean better prioritization.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Jedi Council
&lt;/h2&gt;

&lt;p&gt;And here’s the idea that changed the game.&lt;/p&gt;

&lt;p&gt;A few weeks ago, I wrote about &lt;a href="https://dev.to/posts/summoning-the-wise-mentoring-experts-llm/"&gt;invoking experts as mentors&lt;/a&gt; — asking an LLM to adopt the perspective of Tufte, Munger, or whoever fits your needs. It worked brilliantly in design.&lt;/p&gt;

&lt;p&gt;What if, instead of three generic agents (&lt;em&gt;reuse&lt;/em&gt;, &lt;em&gt;quality&lt;/em&gt;, &lt;em&gt;efficiency&lt;/em&gt;), I used three agents with &lt;strong&gt;names, philosophies, and specific decision rules&lt;/strong&gt;?&lt;/p&gt;

&lt;p&gt;The idea has a name every Star Wars fan will recognize: the &lt;strong&gt;Jedi Council&lt;/strong&gt;. Three masters with different perspectives evaluating the same case. But be careful — this isn’t about the LLM doing &lt;em&gt;surface-level impersonations&lt;/em&gt; by quoting famous lines. It’s about each “wise master” applying &lt;strong&gt;specific filtering rules&lt;/strong&gt; that a generic reviewer wouldn’t.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three Masters (and Why These)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Kent Beck — Simplicity.&lt;/strong&gt; &lt;em&gt;"Make it work, make it right, make it fast — in that order."&lt;/em&gt; He’s the guy who tells you “those two identical blocks of code are fine, don’t extract a helper just yet.” His key rule: &lt;strong&gt;The Rule of Three&lt;/strong&gt;. DO NOT report duplication unless the same block appears three times. Twice is coincidence. Three times is a pattern. And if the fix touches more files than the code it’d clean up, it’s probably not worth it.&lt;/p&gt;

&lt;p&gt;But Beck isn’t just about simplicity. He also catches &lt;strong&gt;correctness bugs&lt;/strong&gt;: cases where the obvious choice has semantics different from the correct one. That &lt;code&gt;async&lt;/code&gt; keyword that seems harmless but inherits a context you don’t want. The &lt;em&gt;default&lt;/em&gt; that works in tests but blows up in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Martin Fowler — Design.&lt;/strong&gt; &lt;em&gt;Code smells&lt;/em&gt; are symptoms, not diseases. Refactoring is a discipline, not a hobby. His key rule: &lt;strong&gt;Only suggest refactoring if there’s a concrete change it would benefit.&lt;/strong&gt; &lt;em&gt;"Refactoring without direction is codebase tourism."&lt;/em&gt; If a string comes from an external API and is only read, don’t suggest converting it to an enum. If one field is always synchronized with another by design, the redundancy is intentional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mike Acton — Performance.&lt;/strong&gt; &lt;em&gt;"The purpose of all programs is to transform data from one form to another."&lt;/em&gt; If you haven’t measured, you don’t have a performance problem — you have an opinion. His key rule: &lt;strong&gt;I/O is what matters in 99% of apps.&lt;/strong&gt; CPU rarely bottlenecks. Disk and network do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Acton Doesn’t Guess — He Measures
&lt;/h3&gt;

&lt;p&gt;Here’s where it gets interesting. Mike Acton doesn’t stop at static analysis. He does two things before rendering a verdict:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Static I/O Counting&lt;/strong&gt;: Scans the diff for read/write operations to disk, network, or databases. Maps each operation to its context: Is it in a loop? A &lt;em&gt;hot path&lt;/em&gt;? Generates a frequency table before opining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real Profiling&lt;/strong&gt;: If the diff touches &lt;em&gt;hot path&lt;/em&gt; code and the project can compile, runs a &lt;em&gt;profiler&lt;/em&gt; and condenses the results. If a &lt;em&gt;hotspot&lt;/em&gt; aligns with code from the diff, it reports it with numbers, not opinions.&lt;/p&gt;

&lt;p&gt;The I/O table includes rough time estimates: SSD read ~0.5ms, write ~1ms, flush ~2-5ms, network ~100-500ms. It’s not precise — it’s for spotting operations that, in aggregate, cross a threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Risk (And How to Avoid It)
&lt;/h2&gt;

&lt;p&gt;Before you start building a Jedi Council for every pull request, here’s the elephant in the room: &lt;strong&gt;The LLM can do surface-level impersonation.&lt;/strong&gt; It might output “as Kent Beck would say…” and just spout the same generic advice under his name.&lt;/p&gt;

&lt;p&gt;To avoid this, the instructions don’t say “adopt Kent Beck’s perspective.” They say: &lt;em&gt;"Apply the Rule of Three: if a fix touches more files than it cleans up, discard it."&lt;/em&gt; Specific rules, not vibes.&lt;/p&gt;

&lt;p&gt;Also, each master &lt;strong&gt;must finish with a “Discarded” section&lt;/strong&gt; — findings they considered but rejected, with the rule applied. This makes it clear that the master actively filtered, not just reported less.&lt;/p&gt;

&lt;p&gt;And if two masters disagree on the same code — Beck says “don’t touch” while Fowler says “refactor” — a moderator agent evaluates the specific case until consensus is reached. If no consensus → discard. Better to do nothing than do the wrong thing.&lt;/p&gt;

&lt;p&gt;Is it the same as having Kent Beck in the room? Obviously not. But it’s infinitely better than three generic agents reporting everything without judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Test: Same Diff, Two Reviews
&lt;/h2&gt;

&lt;p&gt;I ran the same diff through &lt;code&gt;/simplify&lt;/code&gt; and &lt;code&gt;/improve&lt;/code&gt;. Same changes, same project, same session:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;/simplify&lt;/th&gt;
&lt;th&gt;/improve&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reported Findings&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;False Positives&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;3-4&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New Findings&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1 (concurrency bug, HIGH)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Discarded" Section&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes, with applied rule&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Project Context&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;CLAUDE.md + tracker + linters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The new finding &lt;code&gt;/improve&lt;/code&gt; caught that &lt;code&gt;/simplify&lt;/code&gt; didn’t: a concurrency bug where an apparently correct pattern inherited a faulty execution context, causing the UI to freeze. In plain language: the code looked fine, compiled cleanly, but it blocked the main thread. &lt;code&gt;/simplify&lt;/code&gt; missed it because its generic agents don’t look for bugs where the “obvious” choice is wrong. Kent Beck did, because that’s exactly his mandate.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;false positives&lt;/em&gt; from &lt;code&gt;/simplify&lt;/code&gt; — the “redundant” 8-byte field, the already managed concurrency pattern, the enums for external JSON — didn’t show up in &lt;code&gt;/improve&lt;/code&gt;. Cost/benefit filtering caught the first. Project context filtered out the second. Fowler’s rule (“&lt;em&gt;stringly-typed&lt;/em&gt; is fine unless it hurts”) discarded the third.&lt;/p&gt;

&lt;p&gt;What sold me the most: the “Discarded” section. Seeing what each master considered and why they rejected it inspires far more trust than just seeing what they reported. &lt;strong&gt;You know they looked at more than they said.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Install It
&lt;/h2&gt;

&lt;p&gt;The skill is called &lt;code&gt;/improve&lt;/code&gt; and lives in &lt;code&gt;~/.claude/skills/improve/SKILL.md&lt;/code&gt;. It’s a global skill for Claude Code — works in any project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create the directory&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; ~/.claude/skills/improve

&lt;span class="c"&gt;# Copy the SKILL.md (or write your own following Claude Code’s skill structure)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Review only (no code changes)&lt;/span&gt;
/improve

&lt;span class="c"&gt;# Review + apply mechanical fixes&lt;/span&gt;
/improve &lt;span class="nt"&gt;--fix&lt;/span&gt;

&lt;span class="c"&gt;# Review + draft report for senior dev&lt;/span&gt;
/improve &lt;span class="nt"&gt;--report&lt;/span&gt;

&lt;span class="c"&gt;# Review a specific commit range&lt;/span&gt;
/improve &lt;span class="nt"&gt;--diff&lt;/span&gt; HEAD~5..HEAD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/simplify&lt;/code&gt; is a solid starting point. Three generic agents detect duplication, inefficiencies, and &lt;em&gt;code smells&lt;/em&gt;. But without project context, it creates noise; without cost/benefit filtering, it suggests changes that aren’t worth it; and without specific criteria, it treats all findings equally.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/improve&lt;/code&gt; is the next step: three masters with targeted philosophies, project context, cost/benefit analysis, and separation between auto-fixes and backlog issues. Beck tells you when NOT to extract a helper. Fowler tells you when a &lt;em&gt;smell&lt;/em&gt; is purely cosmetic. Acton tells you when a “performance problem” is just an unmeasured opinion.&lt;/p&gt;

&lt;p&gt;Fewer findings, zero &lt;em&gt;false positives&lt;/em&gt;, and one real bug the other missed. Sometimes improvement isn’t more eyes — it’s better eyes.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related:&lt;/strong&gt; &lt;a href="https://dev.to/posts/summoning-the-wise-mentoring-experts-llm/"&gt;Invoking the Experts&lt;/a&gt; — the original technique with Tufte and Munger. &lt;a href="https://dev.to/posts/claude-code-loop-vs-cron-scheduling/"&gt;/loop vs claude-cron&lt;/a&gt; — another Claude Code skill I analyzed.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>codereview</category>
      <category>productivity</category>
      <category>ai</category>
    </item>
    <item>
      <title>RustyClaw: I'm rewriting an AI agent in Rust (because the meme demands it)</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:19:50 +0000</pubDate>
      <link>https://dev.to/frr149/rustyclaw-im-rewriting-an-ai-agent-in-rust-because-the-meme-demands-it-280i</link>
      <guid>https://dev.to/frr149/rustyclaw-im-rewriting-an-ai-agent-in-rust-because-the-meme-demands-it-280i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You know what’s great about Rust? It doesn’t let you compile crappy code. You know what sucks? Everything you write at the beginning **is&lt;/em&gt;* crappy code."*&lt;br&gt;
— Mr. Krabs, probably&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What’s better than an AI agent? An AI agent &lt;em&gt;rewritten in Rust&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you’ve spent more than five minutes on the internet, you’re aware of the meme. It doesn’t matter what project—text editor, DNS server, BMI calculator. Someone will inevitably comment, "you should rewrite it in Rust." It’s the &lt;em&gt;Rewrite It In Rust&lt;/em&gt;—RIIR for friends—and it’s as unavoidable as gravity.&lt;/p&gt;

&lt;p&gt;Well, I’m actually doing it. I’m going to port 8,300 lines of a Python AI agent to Rust. But not just because the meme demands it (okay, maybe a little). I’m doing it because I need a guinea pig.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thesis
&lt;/h2&gt;

&lt;p&gt;For weeks now, I’ve been writing about &lt;a href="https://dev.to/posts/silent-failure-ai-makes-stuff-up-tests-everything-fine/"&gt;&lt;em&gt;silent failures&lt;/em&gt;&lt;/a&gt;, about the &lt;a href="https://dev.to/posts/five-defenses-code-hallucinations/"&gt;five defenses against hallucinations&lt;/a&gt;, about how an LLM can generate code that compiles, passes tests, and is still wrong. I even gave it a name: &lt;strong&gt;adversarial development&lt;/strong&gt;. &lt;em&gt;Never trust, always verify.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A lot of theory. Now it’s time to prove it.&lt;/p&gt;

&lt;p&gt;I needed a project with three key traits: constrained scope (not a new app with ever-changing requirements), a clear source of truth (the Python code that already works), and enough complexity for the LLM’s hallucinations to have room to hide. A pure port checks all three boxes: the input and expected output already exist. If the Rust version doesn’t behave exactly like the Python one, there’s a bug. Simple as that.&lt;/p&gt;

&lt;p&gt;And since I’m going to port something, why not use it as an opportunity to properly learn Rust? The &lt;em&gt;borrow checker&lt;/em&gt;, &lt;em&gt;ownership&lt;/em&gt;, &lt;em&gt;lifetimes&lt;/em&gt;... I’ve spent years reading all about it and touching none of it. Things would be different if I stopped reading tutorials for the 20th time and actually tackled a real project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The patient
&lt;/h2&gt;

&lt;p&gt;It’s called &lt;a href="https://github.com/HKUDS/nanobot" rel="noopener noreferrer"&gt;nanobot&lt;/a&gt;. It’s a personal AI agent derived from OpenClaw: a nifty tool that links LLMs (Claude, GPT, DeepSeek, you name it) to chat channels—Telegram, Discord, Slack, email—and gives them hands. It can read/edit files, run commands, browse the web, schedule cron tasks, and maintain persistent memories between conversations.&lt;/p&gt;

&lt;p&gt;It works. It’s been running fine. In Python.&lt;/p&gt;

&lt;p&gt;What’s the problem? It’s &lt;em&gt;single-threaded&lt;/em&gt;. One message at a time. Send it three messages back-to-back, and they queue up like a Saturday morning line at Walmart. It uses about 50MB of RAM to essentially shuffle JSON between APIs. And its error handling is the type you’re embarrassed about: &lt;code&gt;return f"Error: {str(e)}"&lt;/code&gt; scattered all over.&lt;/p&gt;

&lt;p&gt;To put it bluntly: it works, but it’s a giant hack. Perfect candidate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Rust (besides the meme)?
&lt;/h2&gt;

&lt;p&gt;I could fix it in Python. I could dial up the &lt;code&gt;asyncio&lt;/code&gt;, tighten up error-handling with custom exceptions, and optimize memory. The sane option.&lt;/p&gt;

&lt;p&gt;But sane doesn’t give me a &lt;em&gt;test bench&lt;/em&gt; for adversarial development. Refactoring in Python lacks an external source of truth—the "before" and "after" would share language, libraries, and the LLM’s biases. A port to a different language? That’s different. If Rust’s output differs from Python’s for the same input, somebody’s lying. And that’s exactly the kind of verification I want to test.&lt;/p&gt;

&lt;p&gt;Plus, Rust comes with properties that make the experiment more interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The compiler as a first line of defense.&lt;/strong&gt; Nulls, type mismatches, data races—entire categories of bugs that might silently creep into Python won’t even compile in Rust. How many LLM hallucinations can the compiler block before they hit a test? I want to measure that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True concurrency.&lt;/strong&gt; &lt;code&gt;tokio&lt;/code&gt; allows one &lt;code&gt;spawn&lt;/code&gt; per conversation. In Python, that’s a pain. This is the one functional improvement that really justifies the port.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static binaries.&lt;/strong&gt; A 10MB executable instead of a &lt;code&gt;pip install&lt;/code&gt; with 47 dependencies. That’s a win for distribution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It’s cool.&lt;/strong&gt; Not technically a reason, but I don’t care.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The adventure (and the invite)
&lt;/h2&gt;

&lt;p&gt;RustyClaw—that’s the port’s name—is going to be a publicly documented experiment. Each module I port will be its own blog post. With real data: how many tokens used, cost, how often the AI hallucinated, and how long I fought with the &lt;em&gt;borrow checker&lt;/em&gt;. No sugarcoating.&lt;/p&gt;

&lt;p&gt;If I spend 3 hours on something I could have done in Python in 10 minutes, I’ll admit it. If the LLM invents a non-existent &lt;em&gt;crate&lt;/em&gt; (spoiler: it will), I’ll detail it. If I realize at the end this port wasn’t worth it, I’ll confess to that too.&lt;/p&gt;

&lt;p&gt;Everyone says, "I used AI to write code." No one publishes how much it cost, how often it lied to them, or if the code held up in production. That’s exactly what I’m going to do.&lt;/p&gt;

&lt;p&gt;And I want you to come along for the ride. Because this is going to be an adventure—filled with compiler battles, "WHY WON’T THIS COMPILE IT’S OBVIOUS" moments, and small victories when a differential test passes green. It’s going to be fun. Or, at the very least, honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack (cheat sheet)
&lt;/h2&gt;

&lt;p&gt;If you’re a Pythonista, the left column will look familiar. If you’re a Rustacean, the right. If you’re neither, welcome to the chaos.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Python (nanobot)&lt;/th&gt;
&lt;th&gt;Rust (rustyclaw)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Async runtime&lt;/td&gt;
&lt;td&gt;&lt;code&gt;asyncio&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;tokio&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTTP&lt;/td&gt;
&lt;td&gt;&lt;code&gt;httpx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;reqwest&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM routing&lt;/td&gt;
&lt;td&gt;&lt;code&gt;litellm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Nonexistent&lt;/strong&gt; — custom router&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Telegram&lt;/td&gt;
&lt;td&gt;&lt;code&gt;python-telegram-bot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;teloxide&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Discord&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;websockets&lt;/code&gt; (raw)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;tokio-tungstenite&lt;/code&gt; (raw)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pydantic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;serde&lt;/code&gt; + &lt;code&gt;figment&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;&lt;code&gt;typer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;clap&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Errors&lt;/td&gt;
&lt;td&gt;&lt;code&gt;str(e)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;anyhow&lt;/code&gt; + &lt;code&gt;thiserror&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;&lt;code&gt;loguru&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;tracing&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI copilot&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Claude Code + Codex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task runner&lt;/td&gt;
&lt;td&gt;&lt;code&gt;make&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;just&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Issue tracker&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;linear&lt;/code&gt; CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The row that hurts most is LiteLLM. In Python, it routes 100+ LLM providers in a single call. Nothing comes close in Rust. I’ll need to roll my own router. The upside? About 80% of LLM providers conform to OpenAI’s API, so between &lt;code&gt;async-openai&lt;/code&gt; + a custom base URL, most use-cases are covered. Anthropic will need its own implementation.&lt;/p&gt;

&lt;p&gt;Around ~300 lines of Rust. Sounds manageable. &lt;em&gt;Sounds.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-hallucination strategy (the serious bit)
&lt;/h2&gt;

&lt;p&gt;This is where the adversarial development theory meets reality. An LLM assisting in a port this size is a machine for plausibly inventing things. &lt;/p&gt;

&lt;p&gt;The top risk isn’t that the code won’t compile—Rust doesn’t let garbage compile. The risk is that it compiles, passes tests, and silently does the wrong thing. Exactly the &lt;em&gt;silent failure&lt;/em&gt; I wrote about two weeks ago.&lt;/p&gt;

&lt;p&gt;Five layers of defense:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Rust’s compiler.&lt;/strong&gt; Eliminates nulls, type mismatches, and data races. First free line of defense. But just because it compiles doesn’t make it right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Differential tests.&lt;/strong&gt; Same input → Python nanobot → output. Same input → RustyClaw → output. If they don’t match, something’s off. The Python code is the source of truth. This is the backbone of the experiment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Provenance tracking.&lt;/strong&gt; Each ported file gets a header with its original Python source, LLM session, and test differential results. Total traceability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Crate verification.&lt;/strong&gt; Every crate suggested by the LLM → manually verify on crates.io and docs.rs. LLMs will confidently propose non-existent crates and APIs that just don’t work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Incident logging.&lt;/strong&gt; Every detected hallucination → an issue logged with a &lt;code&gt;hallucination&lt;/code&gt; label. Material for posts and lessons learned.&lt;/p&gt;

&lt;p&gt;The golden rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The verification system must be external to the generator.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the LLM writes the code, the tests, and the fixtures, you’re validating fiction with fiction. Differential testing against the original Python code naturally breaks the cycle and makes the port inherently verifiable.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;Does it matter?&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;So, the uncomfortable question—does porting this to Rust even matter?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Rust (estimated)&lt;/th&gt;
&lt;th&gt;Does it matter?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Response latency&lt;/td&gt;
&lt;td&gt;~200ms overhead&lt;/td&gt;
&lt;td&gt;~5ms overhead&lt;/td&gt;
&lt;td&gt;No. The LLM takes 2-5 seconds anyway.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;~50MB&lt;/td&gt;
&lt;td&gt;~5MB&lt;/td&gt;
&lt;td&gt;No. My server has 8GB.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency&lt;/td&gt;
&lt;td&gt;1 message at a time&lt;/td&gt;
&lt;td&gt;N messages in parallel&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup time&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;td&gt;~50ms&lt;/td&gt;
&lt;td&gt;Meh.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip install&lt;/code&gt; + 47 deps&lt;/td&gt;
&lt;td&gt;Single executable&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type safety&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;str(e)&lt;/code&gt; everywhere&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Result&amp;lt;T, E&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The cool factor&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Subjective.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three out of seven. Four, if we’re being generous. The latency and RAM improvements are meaningless since the bottleneck is always the LLM call. Concurrency matters for multiple users. A static binary is a real upgrade. And the type safety? After seeing how many bugs &lt;code&gt;str(e)&lt;/code&gt; lets fly under the radar for months, yeah, that matters.&lt;/p&gt;

&lt;p&gt;Does it justify weeks of work? As a standalone port, probably not. As a testbed for adversarial development with published real-world data? I think yes. By the end of this series, we’ll have hard numbers—not opinions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The raw numbers
&lt;/h2&gt;

&lt;p&gt;Every work session will be logged in a public CSV in the repo:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
csv
date,llm,model,module,tokens_in,tokens_out,cost_usd,duration_min,loc_python,loc_rust,hallucinations,tests_pass
---

Which LLM I used, tokens consumed, cost, duration, lines ported, hallucinations detected, tests passed. It’ll all be public. All verifiable.

At the end of this series, anyone will be able to sum up `cost_usd` and decide if RIIR was worth it. Anyone will be able to count hallucinations and decide if adversarial development works or is just hype. Spoiler: I have no idea what the numbers will be. And that’s what makes it interesting.

## Join me

- **Repo:** [github.com/frr149/rustyclaw](https://github.com/frr149/rustyclaw)—code, issues, tracking
- **Blog:** Each phase will have its own post here in the *RustyClaw: Rewrite It In Rust* series
- **Backlog:** Public on Linear, visible via GitHub issues

What’s better than an AGI? An AGI rewritten in Rust. Just ask the meme. Now let’s prove it.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>rust</category>
      <category>python</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why 99% of What You Send to Claude Is Already Cached</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:16:48 +0000</pubDate>
      <link>https://dev.to/frr149/why-99-of-what-you-send-to-claude-is-already-cached-mb9</link>
      <guid>https://dev.to/frr149/why-99-of-what-you-send-to-claude-is-already-cached-mb9</guid>
      <description>&lt;p&gt;I'm building an app that monitors my token consumption in Claude Code. A few days ago, looking at the raw numbers, I found this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cacheReadInputTokens:     4,241,579,174
inputTokens:                  1,293,019
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four billion two hundred million tokens read from cache. One million three hundred thousand "fresh" tokens. That's a &lt;strong&gt;99.97%&lt;/strong&gt; cache hit rate.&lt;/p&gt;

&lt;p&gt;My first reaction was thinking something was broken. Nobody has a 99% cache hit rate. Not Redis. Not Cloudflare. Not your mom when she claims she already knows what you're going to ask for dinner.&lt;/p&gt;

&lt;p&gt;But it turns out it's not broken. This is exactly how it works. And the reason is as elegant as it is counterintuitive.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Cached Isn't Text
&lt;/h2&gt;

&lt;p&gt;This is where most explanations fall short. When you read "prompt caching," you think of something like Redis: store the question, store the answer, if someone asks the same question, return the same answer.&lt;/p&gt;

&lt;p&gt;Not at all.&lt;/p&gt;

&lt;p&gt;What gets cached are &lt;strong&gt;KV tensors&lt;/strong&gt; — the Key and Value matrices that the transformer calculates during the prefill phase. In simpler terms: when an LLM receives your prompt, the first thing it does is convert all that text into internal numerical representations (embeddings) and multiply them by weight matrices to get the "keys" (K) and "values" (V) that the attention mechanism needs to generate the response.&lt;/p&gt;

&lt;p&gt;That calculation is &lt;strong&gt;expensive&lt;/strong&gt;. In a 200,000-token prompt (normal for Claude Code, where conversation history accumulates), we're talking about billions of matrix multiplication operations. It's the most GPU-intensive part, the slowest part, the most expensive part.&lt;/p&gt;

&lt;p&gt;The key insight: between one of your messages and the next, 99% of that prompt &lt;strong&gt;doesn't change&lt;/strong&gt;. The system prompt is identical. The previous conversation history is identical. The files it read are the same. The only new thing is your latest message.&lt;/p&gt;

&lt;p&gt;Why recalculate what you already calculated 30 seconds ago?&lt;/p&gt;

&lt;h2&gt;
  
  
  How Matching Works
&lt;/h2&gt;

&lt;p&gt;Caching isn't enough. You need to know when the cache is valid. Anthropic uses an elegant trick: &lt;strong&gt;cumulative prefix hashing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Each block of the prompt (system, tools, messages) generates a hash. But not an individual hash: a &lt;em&gt;cumulative&lt;/em&gt; hash. The hash of block 3 includes the content of blocks 1, 2, and 3. If anything changes in a previous block, the hash of all following blocks changes too.&lt;/p&gt;

&lt;p&gt;When a new request arrives, the system searches backwards from the point marked with &lt;code&gt;cache_control&lt;/code&gt;, comparing hashes block by block, until it finds the &lt;strong&gt;longest matching prefix&lt;/strong&gt;. Everything that matches → read from cache. Only the new stuff → gets calculated.&lt;/p&gt;

&lt;p&gt;It's like a movie you've seen 40 times. You don't need to watch the whole thing to know what happens. You only need to watch from the point where it differs from what you remember.&lt;/p&gt;

&lt;p&gt;The system only checks up to 20 blocks backwards. Beyond that, it stops searching. This is a practical decision to avoid spending more time searching the cache than calculating tensors directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Claude Code Has a 99% Cache Hit Rate
&lt;/h2&gt;

&lt;p&gt;Now that you know how matching works, the 99% stops being mysterious. Look at what happens in a typical Claude Code session:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message 1&lt;/strong&gt; (first in the session):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System prompt (8K tokens) + Tools (2K tokens) + Your message (500 tokens)
= 10,500 tokens → EVERYTHING calculated, EVERYTHING written to cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Message 2:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System prompt (8K) + Tools (2K) + Message 1 (500) + Response 1 (3K) + Your message 2 (500)
= 14,000 tokens
→ First 10,500 → CACHE HIT (already calculated before)
→ The 3,500 new ones → calculated and added to cache
Cache hit: 75%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Message 10:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System prompt + Tools + 9 messages + 9 responses + Your message 10
= ~150,000 tokens
→ First ~149,500 → CACHE HIT
→ The ~500 new ones → calculated
Cache hit: 99.7%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See it? The conversation history &lt;strong&gt;only grows&lt;/strong&gt;. Each new message is a tiny fraction of the accumulated total. The cache ratio converges to 99% with the certainty of a natural logarithm.&lt;/p&gt;

&lt;p&gt;It's not magic. It's geometry: the numerator (new tokens) grows linearly; the denominator (accumulated tokens) also grows linearly, but it has a huge head start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Those Tensors Live
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Because caching KV tensors isn't like caching strings in Redis. We're talking about &lt;strong&gt;gigabytes of numerical data&lt;/strong&gt; that need to be available with microsecond latency.&lt;/p&gt;

&lt;p&gt;Anthropic uses a two-level system:&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1: VRAM (5-minute TTL)
&lt;/h3&gt;

&lt;p&gt;The tensors live directly in the &lt;strong&gt;GPU memory&lt;/strong&gt; that will serve the next request. Zero copy, zero network latency. Cache hits are nearly instantaneous because the data is already where it's needed.&lt;/p&gt;

&lt;p&gt;TTL: 5 minutes. If nobody makes a request in 5 minutes, they get evicted. This is the cache you use with the standard API. Cache write price: 1.25x normal input price.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: GPU Node SSD (1-hour TTL)
&lt;/h3&gt;

&lt;p&gt;If you pay for extended cache write (2x input price), tensors don't get evicted after 5 minutes. Instead, when they leave VRAM due to memory pressure, they get &lt;strong&gt;offloaded to the local SSD of the GPU node&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a cache hit comes in, they're reloaded from SSD to VRAM. Slower than level 1, but infinitely faster than recalculating tensors from scratch.&lt;/p&gt;

&lt;p&gt;The interesting part: &lt;strong&gt;no network involved&lt;/strong&gt;. It's not a remote Redis. It's not S3. It's an SSD physically attached to the server that has the GPU. The architecture is designed to minimize data movement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request → In VRAM? → Yes → Instant cache hit
                   → No → In local SSD? → Yes → Load to VRAM → Cache hit (~ms)
                                        → No → Calculate KV tensors → Cache miss
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since February 2026, isolation is &lt;strong&gt;per workspace&lt;/strong&gt; (previously per organization). This means tensors from your development team don't mix with the marketing team's, even if they're in the same Anthropic organization.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;If you're evaluating whether this matters for your use case, here are the hard facts:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cache read&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0.1x&lt;/strong&gt; input price (90% discount)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache write 5 min&lt;/td&gt;
&lt;td&gt;1.25x input price&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache write 1 hour&lt;/td&gt;
&lt;td&gt;2x input price&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency reduction&lt;/td&gt;
&lt;td&gt;~85% on long prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Minimum cacheable&lt;/td&gt;
&lt;td&gt;1,024 tokens per checkpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With Sonnet, input costs $3.00/M tokens. A cache read costs $0.30/M. In a Claude Code session with 200K tokens of history, the difference between recalculating and reading from cache is the difference between $0.60 and $0.06 &lt;strong&gt;per message&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Multiply that by the hundreds of messages you might exchange in a long session and you understand why Anthropic invested in building this: without prompt caching, long conversations with huge context would be economically unfeasible.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Real Data
&lt;/h2&gt;

&lt;p&gt;Back to my numbers from the beginning. In my Claude Code usage over a month:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cacheReadInputTokens:       4,241,579,174  (4.2 billion — read from cache)
cacheCreationInputTokens:     196,596,243  (197 million — written to cache)
inputTokens:                    1,293,019  (1.3 million — calculated without cache)
outputTokens:                   2,517,666  (2.5 million — generated by the model)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Global cache hit rate: 95.5%&lt;/strong&gt;. And within individual long sessions, it easily exceeds 99%.&lt;/p&gt;

&lt;p&gt;Notice the asymmetry: I've read 4.2 billion tokens from cache, but the model has only &lt;em&gt;generated&lt;/em&gt; 2.5 million tokens of output. The cache-read to actual-work ratio is &lt;strong&gt;1,685:1&lt;/strong&gt;. For every token the model produces, it reuses 1,685 tokens of previous context.&lt;/p&gt;

&lt;p&gt;This also means &lt;code&gt;cacheReadInputTokens&lt;/code&gt; &lt;strong&gt;isn't a good productivity metric&lt;/strong&gt;. It doesn't measure how much you've "used" the model. It measures how much history the model has &lt;em&gt;reread&lt;/em&gt;. It's like measuring your productivity by how many times you've opened the same file in your editor.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Anthropic Doesn't Tell You
&lt;/h2&gt;

&lt;p&gt;There are things that aren't public:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User→GPU affinity&lt;/strong&gt;: How do they ensure your next request lands on the same node that has your cache? Probably sticky routing per session, but they don't confirm it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSD type&lt;/strong&gt;: NVMe? CXL-attached? KV tensors for a 200K token prompt take up several GB. SSD speed matters a lot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PagedAttention&lt;/strong&gt;: vLLM (the most popular open-source serving engine) uses a technique called PagedAttention that manages KV tensors like virtual memory pages. Does Anthropic use something similar, or do they have something proprietary? Unknown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster topology&lt;/strong&gt;: How many GPUs, how they're interconnected, whether they use InfiniBand or Ethernet. Nothing public.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Analogy That Explains Everything
&lt;/h2&gt;

&lt;p&gt;Think of prompt caching as a surgeon's working memory during an operation.&lt;/p&gt;

&lt;p&gt;The surgeon (the model) has to process all the patient information (the prompt) to decide each move (the output). Without cache, they'd have to reread the complete medical history before each cut. With cache, they remember everything they already read and only need to process new information — the latest blood work, the tissue's response to the previous cut.&lt;/p&gt;

&lt;p&gt;What gets saved isn't the patient's documents (the text). It's the &lt;strong&gt;intermediate conclusions&lt;/strong&gt; the surgeon already extracted from those documents (the KV tensors). They don't need to reread the blood work. They already know what it says. They just need to integrate the new information with what they already know.&lt;/p&gt;

&lt;p&gt;The 99% cache hit rate simply reflects that, in a conversation with an LLM, the amount of "what we already know" grows much faster than the amount of "new stuff to process."&lt;/p&gt;

&lt;p&gt;And that's what makes it possible to have 200K token context conversations without each message costing you an arm and a leg.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related:&lt;/strong&gt; If you're interested in what happens when the app monitoring those tokens is based on data invented by the AI itself, read &lt;a href="https://dev.to/silent-failure-ai-invents-tests-say-fine/"&gt;Silent failure: when your AI makes things up and tests say everything's fine&lt;/a&gt;. And if you want to see how I manage API secrets without 1Password asking for Touch ID every 30 seconds, &lt;a href="https://dev.to/authorization-fatigue-1password-cache/"&gt;authorization fatigue and a 40-line cache&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>claude</category>
      <category>anthropic</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>OpenAI scales PostgreSQL for 800 million users with a single writer (no sharding)</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:14:46 +0000</pubDate>
      <link>https://dev.to/frr149/openai-scales-postgresql-for-800-million-users-with-a-single-writer-no-sharding-3ld0</link>
      <guid>https://dev.to/frr149/openai-scales-postgresql-for-800-million-users-with-a-single-writer-no-sharding-3ld0</guid>
      <description>&lt;p&gt;Every time an article comes out about a large company's infrastructure, half the Hacker News comments are variations of "of course they use Kubernetes with 47 microservices and a distributed database with custom consensus protocol." And when it turns out they don't—that they use plain PostgreSQL with a single &lt;em&gt;primary&lt;/em&gt; and discipline—there's an uncomfortable silence.&lt;/p&gt;

&lt;p&gt;That just happened with OpenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers nobody expected
&lt;/h2&gt;

&lt;p&gt;Bohan Zhang, infrastructure engineer at OpenAI, published details about how they scale PostgreSQL for ChatGPT. The numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;800 million users&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A single PostgreSQL *primary&lt;/strong&gt;* (writer) on Azure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~50 *read replicas&lt;/strong&gt;*&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Millions of queries per second&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;p99 of 10-19ms&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;99.999% availability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One SEV-0 in a year&lt;/strong&gt; (and that was from ImageGen's viral launch, which added 100 million new users in a week)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Read that again. One. Single. Writer. For 800 million users.&lt;/p&gt;

&lt;h2&gt;
  
  
  "But they should shard"
&lt;/h2&gt;

&lt;p&gt;No. And the reason is brutally pragmatic.&lt;/p&gt;

&lt;p&gt;Sharding PostgreSQL would have required modifying &lt;strong&gt;hundreds of endpoints&lt;/strong&gt; in the application. Every query that assumes all data lives in the same database—which is practically all of them—would need to be rewritten to know which shard contains each piece of data.&lt;/p&gt;

&lt;p&gt;The cost of that migration? Months of engineering work, new bugs at every corner, and a transition period where you maintain both systems.&lt;/p&gt;

&lt;p&gt;What they did instead? They identified the heaviest &lt;em&gt;writes&lt;/em&gt; and moved them to Cosmos DB. Not because Cosmos is better than PostgreSQL, but because those specific &lt;em&gt;workloads&lt;/em&gt; fit better in a document model. The rest—the vast majority of business logic—stayed in PostgreSQL.&lt;/p&gt;

&lt;p&gt;Instead of complicating the entire system, they isolated the problem and solved it where it hurt. Surgery with a scalpel, not a chainsaw.&lt;/p&gt;

&lt;h2&gt;
  
  
  PgBouncer: from 50ms to 5ms per connection
&lt;/h2&gt;

&lt;p&gt;One of the first bottlenecks they found was connection establishment latency. PostgreSQL creates a process for each new connection. With thousands of simultaneous connections from hundreds of application pods, the connection &lt;em&gt;overhead&lt;/em&gt; consumed 50ms before executing a single query.&lt;/p&gt;

&lt;p&gt;The solution: PgBouncer as a &lt;em&gt;connection pooler&lt;/em&gt;. It maintains a pool of already-established connections and reuses them. Result: connection latency dropped to 5ms. 90% less, by changing a piece of plumbing.&lt;/p&gt;

&lt;p&gt;It's not new technology. PgBouncer has been in production at companies of all sizes for over 15 years. But there it is: a battle-tested, boring tool solving a problem in one of the most-used applications on the planet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ORM that did 12-table joins
&lt;/h2&gt;

&lt;p&gt;This is my favorite. Because I've seen it in my students' projects, in startups, in banks. Everywhere.&lt;/p&gt;

&lt;p&gt;The ORM generated queries with &lt;em&gt;joins&lt;/em&gt; across 12 tables. Not because someone designed it that way, but because the models were related to each other and the ORM, obediently, followed the relationships to the end.&lt;/p&gt;

&lt;p&gt;The solution wasn't changing ORMs or switching to manual queries for everything. It was &lt;strong&gt;moving logic to the application&lt;/strong&gt;. Instead of asking PostgreSQL to do a monstrous &lt;em&gt;join&lt;/em&gt;, they made several simpler queries and assembled the data in code.&lt;/p&gt;

&lt;p&gt;Is that less elegant? Yes. Is it faster? Enormously. Because PostgreSQL can optimize simple queries much better than a 12-table &lt;em&gt;join&lt;/em&gt; with cross conditions. And because you can cache partial results and reuse them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- BEFORE: the ORM generates this&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;profiles&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;teams&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="c1"&gt;-- 12 tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- AFTER: separate queries, logic in application&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;profiles&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- cacheable, parallelizable, debuggeable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each individual query is trivial. The &lt;em&gt;query planner&lt;/em&gt; executes them in microseconds. And if one fails or runs slow, you know exactly which one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The defenses nobody sees
&lt;/h2&gt;

&lt;p&gt;What I find brilliant about Bohan Zhang's article isn't the big numbers, but the small defenses that prevent everything from falling apart:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;If a transaction sits open doing nothing, PostgreSQL kills it after a configurable time. Why does this matter? Because an open transaction &lt;strong&gt;blocks *autovacuum&lt;/strong&gt;&lt;em&gt;. And without *autovacuum&lt;/em&gt;, tables bloat, indexes degrade, and eventually your database gets slower every day.&lt;/p&gt;

&lt;p&gt;It's like leaving the fridge door open. Nothing happens for the first 5 minutes. But if you forget it all night, the next day everything is at room temperature.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema changes with 5-second timeout
&lt;/h3&gt;

&lt;p&gt;When you do an &lt;code&gt;ALTER TABLE&lt;/code&gt; in PostgreSQL, you need a &lt;em&gt;lock&lt;/em&gt; on the table. If there are long transactions running, that &lt;em&gt;lock&lt;/em&gt; waits. And while it waits, &lt;strong&gt;it blocks all new queries&lt;/strong&gt;. A schema migration that takes 200ms can bring down your database if there's an old transaction that won't finish.&lt;/p&gt;

&lt;p&gt;OpenAI's solution: &lt;code&gt;SET lock_timeout = '5s'&lt;/code&gt;. If the migration can't get the &lt;em&gt;lock&lt;/em&gt; in 5 seconds, it aborts. Better to fail fast and retry than block the entire system waiting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rate limiting in 4 layers
&lt;/h3&gt;

&lt;p&gt;Not one. Not two. Four layers of &lt;em&gt;rate limiting&lt;/em&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Edge/CDN&lt;/strong&gt; — blocking abusive traffic before it reaches the application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API gateway&lt;/strong&gt; — limits per user/API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application&lt;/strong&gt; — limits per operation type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt; — &lt;em&gt;connection limits&lt;/em&gt; and &lt;em&gt;statement timeouts&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each layer catches what the previous one lets through. Defense in depth. The same onion philosophy I apply for &lt;a href="https://dev.to/es/cinco-defensas-alucinaciones-codigo/"&gt;defenses against hallucinations&lt;/a&gt;, but for infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workload isolation by priority
&lt;/h3&gt;

&lt;p&gt;Not all queries are equal. A query for "show user's chat" is critical—if it fails, the user sees an error. A query for "generate analytics report" is important, but can wait 30 seconds.&lt;/p&gt;

&lt;p&gt;OpenAI routes queries by priority to different &lt;em&gt;read replicas&lt;/em&gt;. High-priority replicas have less load and respond faster. Low-priority ones can run hotter without affecting user experience.&lt;/p&gt;

&lt;p&gt;It's common sense, but requires discipline. You have to classify each query, configure routing, and resist the temptation to send everything to the fast replica "because it's just one more query."&lt;/p&gt;

&lt;h2&gt;
  
  
  Backfills that take weeks
&lt;/h2&gt;

&lt;p&gt;When you need to populate a new column for 800 million users, you can't do &lt;code&gt;UPDATE users SET new_column = computed_value&lt;/code&gt;. That would lock the table, saturate the disk, and probably bring down the &lt;em&gt;primary&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;At OpenAI, &lt;em&gt;backfills&lt;/em&gt; run with strict &lt;em&gt;rate limiting&lt;/em&gt;. Weeks. A backfill that takes weeks.&lt;/p&gt;

&lt;p&gt;Sound horrible? It's the opposite. It's the decision of a team that understands backfill speed is irrelevant compared to system stability. Better to take 3 weeks with nobody noticing than take 3 hours and have a SEV-0 at 2 AM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cascading replication that's coming
&lt;/h2&gt;

&lt;p&gt;Currently they have ~50 replicas connected directly to the &lt;em&gt;primary&lt;/em&gt;. Each replica consumes a replication connection and bandwidth from the &lt;em&gt;primary&lt;/em&gt;. With 50 it's manageable. With 100+ it would be a problem.&lt;/p&gt;

&lt;p&gt;The solution they're developing: &lt;strong&gt;cascading replication&lt;/strong&gt;. Replicas that replicate from other replicas, not from the &lt;em&gt;primary&lt;/em&gt;. A tree instead of a star. The &lt;em&gt;primary&lt;/em&gt; sends data to 5-10 first-level replicas, and those replicas feed the rest.&lt;/p&gt;

&lt;p&gt;It's the same idea as BitTorrent. Instead of everyone downloading from the same server, nodes share with each other. Works for pirated movies, works for WAL segments.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson nobody wants to hear
&lt;/h2&gt;

&lt;p&gt;The industry has an addiction to &lt;em&gt;over-engineering&lt;/em&gt;. Every week a new database comes out promising to solve problems most companies don't have. And every week, engineering teams adopt those technologies because they "scale better" or "are more modern," without asking whether PostgreSQL with a bit of discipline would do the job.&lt;/p&gt;

&lt;p&gt;OpenAI—the company defining the future of AI, with one of the fastest-growing products in history—uses PostgreSQL. With a single &lt;em&gt;primary&lt;/em&gt;. No sharding. No exotic distributed database.&lt;/p&gt;

&lt;p&gt;They use PgBouncer (2007). Read replicas (concept from the 90s). &lt;em&gt;Connection pooling&lt;/em&gt; (as old as relational databases). &lt;em&gt;Rate limiting&lt;/em&gt; (invented before most of us were born).&lt;/p&gt;

&lt;p&gt;The magic isn't in the technology. It's in the discipline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple queries instead of monstrous &lt;em&gt;joins&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Aggressive timeouts instead of infinite waits&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Workload&lt;/em&gt; isolation instead of "everything on the same server"&lt;/li&gt;
&lt;li&gt;Migrate only what needs migrating, don't rewrite everything&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  For your next standup
&lt;/h2&gt;

&lt;p&gt;The next time someone on your team proposes migrating to a distributed database, or sharding PostgreSQL, or adding a queue service between the API and database "because it won't scale," show them these numbers.&lt;/p&gt;

&lt;p&gt;800 million users. One &lt;em&gt;primary&lt;/em&gt;. p99 of 10-19ms. 99.999% uptime.&lt;/p&gt;

&lt;p&gt;And ask: "Is our problem really that PostgreSQL doesn't scale? Or is it that our queries are a mess?"&lt;/p&gt;

&lt;p&gt;Because it's almost always the second one.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Source:&lt;/strong&gt; &lt;a href="https://blog.openai.com/" rel="noopener noreferrer"&gt;Inside the Postgres Setup Powering 800M ChatGPT Users&lt;/a&gt; — Bohan Zhang, OpenAI. If you read only one infrastructure article this year, make it this one.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>scalability</category>
      <category>infrastructure</category>
      <category>openai</category>
    </item>
    <item>
      <title>Madness Driven Design: Don Quixote, Sancho Panza, and Your AI Copilot</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:11:43 +0000</pubDate>
      <link>https://dev.to/frr149/madness-driven-design-don-quixote-sancho-panza-and-your-ai-copilot-fhd</link>
      <guid>https://dev.to/frr149/madness-driven-design-don-quixote-sancho-panza-and-your-ai-copilot-fhd</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: An LLM is like Don Quijote—you can't cure his madness, it's stochastic by nature. The solution isn't to fix the madman but to assign him a deterministic Sancho Panza as a sidekick. MDD consists of two layers: first, you study the errors it makes to design tools that absorb those mistakes, and then you let it loose with those tools to verify you've closed any gaps. Design for madness, not against it.&lt;/p&gt;




&lt;p&gt;I spent weeks auditing logs. 165 sessions of an AI agent interacting with a CLI to manage tasks. Over 500 errors. 370 retries. Patterns emerged, repeating over and over: the agent would use &lt;code&gt;--status&lt;/code&gt; when the flag was actually called &lt;code&gt;--state&lt;/code&gt;. It would write &lt;code&gt;Todo&lt;/code&gt; when the API expected &lt;code&gt;unstarted&lt;/code&gt;. It would pass &lt;code&gt;urgent&lt;/code&gt; as a priority when the system only accepted numbers.&lt;/p&gt;

&lt;p&gt;And what fascinated me was that every single error made sense. They weren't random. They were &lt;em&gt;plausible&lt;/em&gt;. Exactly the kind of mistakes you or I would make if we "kind of" understood a domain but had never read the documentation carefully.&lt;/p&gt;

&lt;p&gt;At some point during the audit, staring at yet another &lt;code&gt;--status Done&lt;/code&gt; that should have been &lt;code&gt;--state completed&lt;/code&gt;, I realized I was witnessing a literary pattern. One that is 400 years old.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don Quijote is an LLM
&lt;/h2&gt;

&lt;p&gt;Think about it for a minute. Don Quijote sees windmills and says, "Those are giants." He's not stupid—he's a well-read man, deeply familiar with tales of chivalry. His problem is that his model of the world has been contaminated with fictitious training data. He's read so many tales of knightly adventure that when he encounters something ambiguous, he interprets it according to his &lt;em&gt;training data&lt;/em&gt;: Windmills → giants. Flocks of sheep → armies. Inns → castles.&lt;/p&gt;

&lt;p&gt;An LLM does exactly the same thing. It has seen thousands of APIs during training. When you ask it to use one it doesn't know well, it doesn't say, "I don't know." It guesses. And it guesses well. Most of the time. Well enough that you'll trust it. And when it fails, the failure is &lt;em&gt;plausible&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--status&lt;/code&gt; instead of &lt;code&gt;--state&lt;/code&gt;. Because in 60% of the CLIs it has seen, the flag is called &lt;code&gt;--status&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Todo&lt;/code&gt; instead of &lt;code&gt;unstarted&lt;/code&gt;. Because in the GUI of the tool, the column is labeled "Todo." The LLM has seen screenshots in documentation. It's read blogs. It infers that if the UI says "Todo," the API must accept "Todo." Makes sense. But it's wrong.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;urgent&lt;/code&gt; instead of &lt;code&gt;1&lt;/code&gt;. Because in most priority systems, &lt;code&gt;urgent&lt;/code&gt; is a valid value. Who designs an API where priority is an integer from 1 to 4 instead of labeled options?&lt;/p&gt;

&lt;p&gt;Each hallucination is a reasonable inference based on incomplete data. Don Quijote isn't stupid. He's mad. And you can't cure madness.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Cervantes Already Knew
&lt;/h2&gt;

&lt;p&gt;Cervantes didn't try to cure Don Quijote. What he did was place Sancho Panza by his side.&lt;/p&gt;

&lt;p&gt;Sancho isn't brilliant. He hasn't read any books. He has no grand visions. But he is &lt;em&gt;deterministic&lt;/em&gt;. When Don Quijote says, "Look at those giants," Sancho replies, "Sir, they're windmills." Don Quijote doesn't always listen, but the information is there. The system has two layers: a stochastic one that generates hypotheses (Don Quijote) and a deterministic one that checks them against reality (Sancho).&lt;/p&gt;

&lt;p&gt;That's the architecture you need when working with an LLM. You're not going to stop it from hallucinating—it's in its nature. What you &lt;em&gt;can&lt;/em&gt; do is build deterministic filters to catch those hallucinations before they cause harm.&lt;/p&gt;

&lt;p&gt;And this is where the methodology comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  MDD: Madness Driven Design
&lt;/h2&gt;

&lt;p&gt;MDD has two layers, and the order matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: A Priori Archaeology
&lt;/h3&gt;

&lt;p&gt;Before you write a single line of code, you study the madness. You don’t guess—you observe. You gather real data on how the LLM interacts with existing tools and catalog its errors.&lt;/p&gt;

&lt;p&gt;In my case, I analyzed 165 sessions of an AI agent using a CLI to manage a software development team. The numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error Category&lt;/th&gt;
&lt;th&gt;Occurrences&lt;/th&gt;
&lt;th&gt;Retry Attempts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Invented or invalid flags&lt;/td&gt;
&lt;td&gt;275&lt;/td&gt;
&lt;td&gt;~150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broken JSON/GraphQL escaping&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;80+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Naming confusion&lt;/td&gt;
&lt;td&gt;40+&lt;/td&gt;
&lt;td&gt;50+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Impossible CLI operations&lt;/td&gt;
&lt;td&gt;60+&lt;/td&gt;
&lt;td&gt;90+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verbose output wasting tokens&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Using that data, you design the new tool to &lt;em&gt;absorb&lt;/em&gt; the errors instead of rejecting them. In plain English: the sane adapts to the mad, not the other way around.&lt;/p&gt;

&lt;p&gt;Concrete examples of absorption:&lt;/p&gt;




&lt;p&gt;LLM error               → Tool design&lt;br&gt;
────────────────────────────────────────&lt;br&gt;
--status Done           → --status is an alias for --state&lt;br&gt;
                          Normalize "Done" to "completed"&lt;/p&gt;

&lt;p&gt;--priority urgent       → Normalize "urgent" to 1&lt;br&gt;
                          "high" → 2, "medium" → 3, "low" → 4&lt;/p&gt;

&lt;p&gt;--no-pager              → Silently ignore flag&lt;br&gt;
                          (the tool never uses a pager)&lt;/p&gt;

&lt;p&gt;Broken quote escaping → Require input via files or stdin&lt;br&gt;
in descriptions         Never inline. Serde handles it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Each row in that table represents a design decision based on a real observed error. Not speculations about "what could go wrong," but logs showing "this wrong thing happened 40 times in 165 sessions."

The difference from conventional design is subtle but important. In normal design, you define the correct interface and reject anything that doesn't fit. In MDD, you define the correct interface _and_ all the likely incorrect interfaces your user will try, and you absorb them.

It's like designing a door that opens both by pushing and pulling. The "correct" door only opens in one direction. The _better_ door opens both ways because you've observed that 40% of people push instead of pulling.

### Layer 2: A Posteriori Verification

You build the tool with the defenses of Layer 1, and then you let it loose. You give the new tool to the LLM and watch what _new_ mistakes it makes.

If Layer 1 was thorough, the new mistakes should be minimal. If new errors appear, you've found gaps in your design. Every new error is an involuntary penetration test.

When I did this with my CLI, the LLM invented things I hadn't seen in the original audit:

- **A sorting enum that didn't exist.** The API allows sorting by `createdAt` and `updatedAt`. The LLM invented a `priority` sorting value. Perfectly logical—why _couldn’t_ you sort by priority? But it doesn't exist in the GraphQL schema.

- **A filtering operator that didn't exist.** To filter by state, the API accepts `state.type.in`. The LLM generated `state.id.or`. Coherent syntax, reasonable pattern, completely fabricated.

- **A file-locking function from another language.** In a Rust project, the LLM suggested `fcntl.flock` for file locking. That's a Python function. In Rust, you'd use the `fs2` crate.

Each of these errors was plausible. None were stupid. And each revealed a gap: the tool didn't validate the sorting enum, didn't reject fake filter operators, and the documentation for the file-locking crate wasn't included in the agent's context.

Layer 2 closes the loop. You don't assume your design is correct—you verify it by unleashing your most creative error-prone tester (the LLM).

## The Sancho Panza Stack

The Don Quijote-Sancho Panza metaphor isn’t just a cute comparison. It’s an architecture. In practice, "Sancho Panza" isn't a single entity—it's a _stack_ of deterministic layers, each one catching a different type of madness:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;┌──────────────────────────────────────┐&lt;br&gt;
│         LLM (Don Quijote)            │  Generates plausible commands &lt;br&gt;
│         Stochastic, creative         │  but potentially incorrect&lt;br&gt;
└──────────────┬───────────────────────┘&lt;br&gt;
               │ "--status Done --priority urgent"&lt;br&gt;
┌──────────────▼───────────────────────┐&lt;br&gt;
│  1. CLI Parser (clap)                │  Rejects flags that don’t exist&lt;br&gt;
│     Accepts aliases: --status→--state│&lt;br&gt;&lt;br&gt;
└──────────────┬───────────────────────┘&lt;br&gt;
               │ "--state Done --priority urgent"&lt;br&gt;
┌──────────────▼───────────────────────┐&lt;br&gt;
│  2. Normalization                    │  Normalize "Done"→"completed",&lt;br&gt;
│     state and priority aliases       │  "urgent"→1&lt;br&gt;
└──────────────┬───────────────────────┘&lt;br&gt;
               │ "--state completed --priority 1"&lt;br&gt;
┌──────────────▼───────────────────────┐&lt;br&gt;
│  3. Validation                       │  Check if "completed" is a valid &lt;br&gt;
│     Against known enums              │  state, if "1" is in range&lt;br&gt;
└──────────────┬───────────────────────┘&lt;br&gt;
               │ state=completed, priority=1&lt;br&gt;
┌──────────────▼───────────────────────┐&lt;br&gt;
│  4. Serialization (serde)            │  Escapes inputs correctly&lt;br&gt;
│     GraphQL variables, no strings    │&lt;br&gt;&lt;br&gt;
│     interpolated                     │&lt;br&gt;&lt;br&gt;
└──────────────┬───────────────────────┘&lt;br&gt;
               │ {"state":"completed","priority":1}&lt;br&gt;
┌──────────────▼───────────────────────┐&lt;br&gt;
│  5. API + Error Handling             │  If the API rejects something,&lt;br&gt;
│     Retry with backoff, actionable   │  returns useful errors&lt;br&gt;
│     messages                         │&lt;br&gt;&lt;br&gt;
└──────────────────────────────────────┘&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;
Five layers. Each one deterministic. Each one designed to catch a specific class of errors the LLM is guaranteed to make. The LLM doesn’t need to be right—it just needs to be _approximately_ right, and the stack takes care of the rest.

It’s like a purification funnel. Dirty water (stochastic LLM input) goes in at the top, and clean water (valid GraphQL queries) comes out the bottom. Each layer filters a specific impurity. No single layer is sufficient. All of them together are.

&lt;span class="gu"&gt;## MDD vs. Fuzz Testing: The Key Difference&lt;/span&gt;

If you’re familiar with fuzz testing, you might think "this is the same thing." It’s not.

|                            | Fuzz Testing              | MDD                                   |
| -------------------------- | ------------------------- | ------------------------------------- |
| &lt;span class="gs"&gt;**Input**&lt;/span&gt;                  | Random, malformed         | Plausible, coherent, well-written     |
| &lt;span class="gs"&gt;**Goal**&lt;/span&gt;                   | Find crashes, segfaults   | Find semantic errors                   |
| &lt;span class="gs"&gt;**Does input look valid?**&lt;/span&gt; | No                        | Yes—that's the problem                |
| &lt;span class="gs"&gt;**Example**&lt;/span&gt;                | &lt;span class="sb"&gt;`\x00\xff\xfe`&lt;/span&gt; as a name  | &lt;span class="sb"&gt;`--priority urgent`&lt;/span&gt; as a flag         |

A fuzzer generates garbage and sees if your program crashes. MDD generates input that _looks_ correct but is factually wrong. &lt;span class="sb"&gt;`--priority urgent`&lt;/span&gt; isn’t garbage—it’s exactly what a human, familiar with the domain but not the API, would write. A fuzzer would never generate that because it’s too coherent.

The same applies to mutation testing and chaos engineering. They mutate your code or break your infrastructure to see if your tests catch it. MDD doesn’t break anything—it generates input that is _correct according to another worldview_. It’s the difference between a brute-force attack and a social engineering attack. One tries every combination; the other convinces you to open the door.

&lt;span class="gu"&gt;## The Actionable Takeaway&lt;/span&gt;

You don’t need to build a CLI in Rust to apply MDD. The pattern works with any tool an LLM might use:

&lt;span class="gs"&gt;**Step 1: Observe the madness.**&lt;/span&gt; Before designing (or redesigning) a tool, make the LLM use the current version and log every error. Not 5 sessions—50. Patterns emerge with volume.

&lt;span class="gs"&gt;**Step 2: Categorize errors.**&lt;/span&gt; Are they nomenclature issues? Formatting errors? Semantic misunderstandings? Each category requires a different type of defense.

&lt;span class="gs"&gt;**Step 3: Design to absorb.**&lt;/span&gt; Don’t reject &lt;span class="sb"&gt;`--status`&lt;/span&gt; with a cryptic error. Accept &lt;span class="sb"&gt;`--status`&lt;/span&gt; as an alias for &lt;span class="sb"&gt;`--state`&lt;/span&gt;. Don’t reject &lt;span class="sb"&gt;`urgent`&lt;/span&gt; as a priority. Normalize it to &lt;span class="sb"&gt;`1`&lt;/span&gt;. The user you’ll most often have is an agent that knows 80% of the domain. Design for that 80%.

&lt;span class="gs"&gt;**Step 4: Release and verify.**&lt;/span&gt; Hand the new tool to the LLM without special instructions. Every new error is a gap in Layer 1. Patch it and iterate.

If humans and LLMs are both going to use your tool, MDD defenses improve the experience for everyone. Because humans make the same mistakes as LLMs—just fewer of them and with more embarrassment.

&lt;span class="gu"&gt;## The Architect Designs the Sancho&lt;/span&gt;

There’s a common misconception I want to clear up. The LLM doesn’t design the Sancho Panza Stack. The LLM is Don Quijote. You are Cervantes.

You’re the one observing the madness patterns. You’re the one deciding what to normalize and reject. You’re the one building the deterministic layers. The LLM can help implement them—it’s great at cranking out code—but the design decisions are yours.

It’s the difference between "I asked my AI to fix its own mistakes" (doesn’t work—it will repeat them) and "I observed my AI’s mistakes and built a system to absorb them" (works—the system is deterministic).

No way should you trust the LLM to self-correct. Its stochastic nature makes it certain to repeat the same errors with creative variations. What you need isn’t a better LLM—it’s a better Sancho.

&lt;span class="gu"&gt;## What Really Matters&lt;/span&gt;

MDD isn’t a testing methodology. It’s a _tool design methodology_. The question isn’t "How do I detect when the LLM is wrong?" but "How do I design so that being wrong has no consequences?"

It’s the same philosophy as guardrails on a mountain road. You don’t prevent bad turns—you put up a barrier so bad turns don’t kill you. You don’t fix the driver—you make the road safer.

Cervantes understood this four centuries ago. He didn’t try to cure Don Quijote. He gave him Sancho Panza and let the story work.

Your CLI, your API, your SDK—whatever your LLM is going to touch—needs its own Sancho. Deterministic, stubborn, incapable of hallucination. Not brilliant. Not creative. Just correct.

Design for madness. The sane adapt to the mad.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>llm</category>
      <category>rust</category>
      <category>cli</category>
    </item>
    <item>
      <title>My AI Read a JSON File from Disk 900 Times in a Loop (And Why No Linter Can Save You)</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:09:41 +0000</pubDate>
      <link>https://dev.to/frr149/my-ai-read-a-json-file-from-disk-900-times-in-a-loop-and-why-no-linter-can-save-you-21eg</link>
      <guid>https://dev.to/frr149/my-ai-read-a-json-file-from-disk-900-times-in-a-loop-and-why-no-linter-can-save-you-21eg</guid>
      <description>&lt;p&gt;Last week my AI wrote code that read a JSON file from disk, parsed it, did &lt;strong&gt;one&lt;/strong&gt; lookup, and repeated this 900 times inside a &lt;code&gt;for&lt;/code&gt; loop. Each iteration: open file, decode JSON, look up a value, throw it all away. Start over.&lt;/p&gt;

&lt;p&gt;It's a mistake I teach my students not to make within their first month of programming.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened (straight to the point)
&lt;/h2&gt;

&lt;p&gt;I'm building Tokamak, a macOS menu bar app that monitors Claude Max quota. Part of the functionality scans ~900 JSONL files from Claude Code sessions. For each file, it needs to know the &lt;em&gt;byte offset&lt;/em&gt; where it left off last time (incremental reading — only process what's new).&lt;/p&gt;

&lt;p&gt;The offsets are stored in a JSON file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"offsets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"project-a/session-1.jsonl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;48231&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"project-b/session-2.jsonl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12044&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;Dictionary&amp;lt;String, UInt64&amp;gt;&lt;/code&gt;. 900 entries. ~55KB. Nothing fancy.&lt;/p&gt;

&lt;p&gt;And here's the detail that makes it even more absurd: &lt;strong&gt;the app itself created this file&lt;/strong&gt;. It's not JSON from an external API. It doesn't come from Claude Code. It's an internal state file that Tokamak writes and reads to track where it left off reading each session. The AI was reading from disk 900 times a file that it had generated itself.&lt;/p&gt;

&lt;p&gt;"Why not use Core Data or SQLite, since you already have them in the app?" Good question. Because this file is a &lt;strong&gt;disposable progress cache&lt;/strong&gt;. If it gets corrupted, you delete it and the next scan rebuilds all offsets by reading the entire files once. Zero data loss. Plus: I can &lt;code&gt;cat session-offsets.json | jq .&lt;/code&gt; to debug (with Core Data I need &lt;code&gt;sqlite3&lt;/code&gt; and the sandbox path), it's &lt;code&gt;Sendable&lt;/code&gt; without the background context dance, and if Core Data's SQLite gets corrupted it doesn't take down the offsets (or vice versa). For 55KB of a flat dictionary, the ceremony of an entity with schema migration isn't justified.&lt;/p&gt;

&lt;p&gt;The format wasn't the problem. The access was.&lt;/p&gt;

&lt;p&gt;The code the AI wrote for the scan loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  &lt;span class="c1"&gt;// 900 files&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;storedOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;offsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relativePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// ↑ THIS reads and parses the JSON from disk. Every. Time.&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fileSize&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;storedOffset&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// ... read file, update offset ...&lt;/span&gt;
    &lt;span class="n"&gt;offsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setOffset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relativePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// ↑ And THIS reads it AGAIN, modifies, and saves it.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two disk calls per iteration. 900 iterations. &lt;strong&gt;1,800 I/O operations&lt;/strong&gt; where there should have been exactly &lt;strong&gt;2&lt;/strong&gt;: one read at the start, one write at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers (xctrace doesn't lie)
&lt;/h2&gt;

&lt;p&gt;I caught it with &lt;em&gt;Instruments&lt;/em&gt; (Time Profiler). The data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total samples&lt;/td&gt;
&lt;td&gt;7,260&lt;/td&gt;
&lt;td&gt;489&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Samples in &lt;code&gt;OffsetStore.load()&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;1,704 (88%)&lt;/td&gt;
&lt;td&gt;10 (2%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scan time&lt;/td&gt;
&lt;td&gt;&amp;gt;20s&lt;/td&gt;
&lt;td&gt;&amp;lt;0.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;81%&lt;/td&gt;
&lt;td&gt;~1.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;88% of scan time was reading and parsing a 900-line JSON. Over and over. Like Sisyphus pushing his boulder, but with &lt;code&gt;JSONDecoder&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix (that should make you cringe)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BEFORE: I/O on every iteration&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;offsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relativePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// reads JSON&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;offsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setOffset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relativePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// reads + writes JSON&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// AFTER: load once, operate in memory, save once&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;offsets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;offsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;// ONCE&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relativePath&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;// O(1) in memory&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relativePath&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;newOffset&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;offsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// ONCE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data structure didn't change. It was still a &lt;code&gt;Dictionary&amp;lt;String, UInt64&amp;gt;&lt;/code&gt;. The &lt;em&gt;hash table&lt;/em&gt; was already optimal. What was suboptimal was &lt;strong&gt;rebuilding it from disk on every iteration&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What doesn't work: adding "don't do this" to your CLAUDE.md
&lt;/h2&gt;

&lt;p&gt;After the fix, I added this to the project's &lt;code&gt;CLAUDE.md&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"NEVER do I/O (disk, network, decode JSON, Core Data fetch) inside a loop if it can be done before. Load data once before the loop, operate in memory, save once after."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And here's what I really want to tell you: &lt;strong&gt;it didn't help at all&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Weeks later, when adding a second service (Codex), the AI generated exactly the same pattern. With the instruction right there. It's like putting up a "keep off the grass" sign and expecting it to work.&lt;/p&gt;

&lt;p&gt;Why? Because the LLM doesn't &lt;em&gt;understand&lt;/em&gt; the rule. It has &lt;em&gt;seen&lt;/em&gt; it. Statistically, most code it read during training does punctual I/O, not in 900-iteration loops. The &lt;code&gt;load → use → save&lt;/code&gt; pattern in a function is most likely. That this function gets called inside a 900-iteration &lt;code&gt;for&lt;/code&gt; loop is a contextual detail the model has no incentive to track.&lt;/p&gt;

&lt;h2&gt;
  
  
  What also doesn't work: linters
&lt;/h2&gt;

&lt;p&gt;No linter can catch this. Not SwiftLint, not ESLint, not Ruff, not Clippy. Think about it: the code is &lt;strong&gt;syntactically correct and semantically valid&lt;/strong&gt;. Each individual call to &lt;code&gt;offsetStore.offset(for:)&lt;/code&gt; is perfectly reasonable. The problem isn't in any single line — it's in the &lt;strong&gt;composition&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Looking at the layers of code meaning (an idea I use in my adversarial development course):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Fails here?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. Signal&lt;/td&gt;
&lt;td&gt;Is this code?&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. Language&lt;/td&gt;
&lt;td&gt;Is it valid Swift?&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. Syntax&lt;/td&gt;
&lt;td&gt;Does it compile?&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. Local semantics&lt;/td&gt;
&lt;td&gt;Does the function do what it promises?&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5. System semantics&lt;/td&gt;
&lt;td&gt;Does it respect contracts and performance?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. Architecture&lt;/td&gt;
&lt;td&gt;Does it scale without degrading?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failure is in layers 5-6. Exactly where LLMs fail today in 2026. The syntax and local logic are impeccable. The problem is &lt;em&gt;emergent&lt;/em&gt;: it appears when a correct function gets used in a context that turns it into a bottleneck.&lt;/p&gt;

&lt;p&gt;A linter operates in layers 2-4. &lt;strong&gt;It has no visibility into composition or performance.&lt;/strong&gt; It's like asking Word's spell checker to detect a logical fallacy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The only thing that works: performance tests after the fact
&lt;/h2&gt;

&lt;p&gt;After the first fix, I wrote this test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;@Test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Scan performance does not degrade with file count"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;scanPerformanceDoesNotDegradeWithFileCount&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;throws&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Create 1000 JSONL files with minimal content&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"..."&lt;/span&gt; &lt;span class="c1"&gt;// one valid line&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appendingPathComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"session-&lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;.jsonl"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// Pre-populate offset store (simulate re-scan)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;offsets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SessionOffsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;OffsetData&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"session-&lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;.jsonl"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;offsetStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;offsets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;ContinuousClock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;ContinuousClock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;

    &lt;span class="cp"&gt;#expect(elapsed &amp;lt; .seconds(3))  // &amp;lt;3s for 1000 files&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a brutally simple regression test. 1000 files, under 3 seconds, or the test fails. If anyone (human or AI) puts I/O back inside the loop, the test goes from taking 0.2 seconds to taking 30, and explodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And this is exactly what happened.&lt;/strong&gt; When the AI generated the second service with the same bug, the first service's performance test kept passing (it was a different service). But when I wrote the equivalent test for the new service, it failed immediately. The test did its job: catch the regression that neither the &lt;code&gt;CLAUDE.md&lt;/code&gt; nor any linter could see.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this confirms
&lt;/h2&gt;

&lt;p&gt;This bug is the perfect demonstration of the central thesis of what I call &lt;strong&gt;adversarial development&lt;/strong&gt;: &lt;em&gt;never trust, always verify&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;You can't trust that AI won't make freshman-level mistakes. It will. Repeatedly. Even when you tell it not to.&lt;/p&gt;

&lt;p&gt;You can't trust that linters will catch it. They can't. The error is above their abstraction level.&lt;/p&gt;

&lt;p&gt;What you can do:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Performance tests&lt;/strong&gt; as an after-the-fact safety net&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real profiling&lt;/strong&gt; (xctrace, Instruments) to measure, not guess&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Defense in depth&lt;/strong&gt;: multiple layers, because no single layer covers everything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The defense isn't a wall. It's an onion. Layers upon layers. And when one fails, the next one catches it.&lt;/p&gt;

&lt;h2&gt;
  
  
  For the skeptics
&lt;/h2&gt;

&lt;p&gt;"But Fernando, wouldn't a human programmer make the same mistake?"&lt;/p&gt;

&lt;p&gt;A junior, yes. A senior, probably not — because they have the pattern internalized. But even a senior would do &lt;em&gt;code review&lt;/em&gt; and catch it. The problem with AI-generated code is &lt;strong&gt;volume&lt;/strong&gt;: 50 files in 10 minutes. Nobody reviews 50 files line by line. Discriminator fatigue is real.&lt;/p&gt;

&lt;p&gt;And that's why you need verification to be automatic, not human. The performance test doesn't get tired. It doesn't get distracted. It has no fatigue. It runs every time you do &lt;code&gt;make test&lt;/code&gt; and tells you if something smells wrong.&lt;/p&gt;

&lt;p&gt;It's the same principle I apply in &lt;a href="https://dev.to/es/cinco-defensas-alucinaciones-codigo/"&gt;the 5 defenses against hallucinations&lt;/a&gt;: the verification system must be external to the generator. If the AI writes the code, verification has to come from somewhere else. In this case, from a clock that measures how long it takes.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>swift</category>
    </item>
    <item>
      <title>Linear Agent Isn’t What You Need. Your Agent Was Already in the Terminal</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:06:38 +0000</pubDate>
      <link>https://dev.to/frr149/linear-agent-isnt-what-you-need-your-agent-was-already-in-the-terminal-45pk</link>
      <guid>https://dev.to/frr149/linear-agent-isnt-what-you-need-your-agent-was-already-in-the-terminal-45pk</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Linear just launched an integrated AI agent. Cool, but it doesn’t address the problem developers face when working with &lt;em&gt;coding agents&lt;/em&gt; in the terminal. What we actually need isn’t another AI agent but a rock-solid CLI that our existing agents can use seamlessly. And if we’re going to build one, it should be in Rust — which is why &lt;a href="https://github.com/frr149/lql" rel="noopener noreferrer"&gt;lql&lt;/a&gt; exists: a CLI for Linear, purpose-built for agents.&lt;/p&gt;




&lt;p&gt;Yesterday, Linear &lt;a href="https://linear.app/changelog/2026-03-24-introducing-linear-agent" rel="noopener noreferrer"&gt;launched their AI agent&lt;/a&gt;. It’s an integrated chatbot that gets your &lt;em&gt;roadmap&lt;/em&gt;, your issues, and even your code. You can chat with it on Slack, mention it in a comment, and it’ll synthesize context, suggest actions, and even create issues for you.&lt;/p&gt;

&lt;p&gt;Sounds awesome. Seriously, it sounds great.&lt;/p&gt;

&lt;p&gt;And yet, when I read the announcement, the first thought that crossed my mind was: “This is not what I needed.”&lt;/p&gt;

&lt;h2&gt;
  
  
  The Linear Saga
&lt;/h2&gt;

&lt;p&gt;To understand why I’m saying that, some context might help. My relationship with Linear has been a love-hate story worthy of a daytime soap opera.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act I: The MCP.&lt;/strong&gt; Linear had this MCP server for AI agents to interact with. It worked like a lighter in a hurricane: technically it could light up, but the flame wouldn’t last more than two seconds. It was janky, slow, and had a special talent for failing right when you needed it the most. I uninstalled it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act II: The GraphQL API.&lt;/strong&gt; The alternative was interacting directly with Linear via GraphQL. And, yes, it worked. Until the moment you had to input special characters in an issue description, and dealing with escaping made you question your life choices. There was this one time I spent more time figuring out how to escape a parenthesis than writing the actual code the issue described.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act III: The Linear CLI.&lt;/strong&gt; Enter &lt;a href="https://github.com/schpet/linear-cli" rel="noopener noreferrer"&gt;&lt;code&gt;linear&lt;/code&gt; CLI&lt;/a&gt;, a community-driven project. &lt;code&gt;brew install schpet/tap/linear&lt;/code&gt; and off you go. It was a humble tool, no frills, but it did exactly what I needed: create, list, and update issues from the terminal without wrestling GraphQL or ghost MCPs. No pop-ups, no frills.&lt;/p&gt;

&lt;p&gt;&lt;a href="//{{&amp;lt;%20relref%20"&gt;}}"&amp;gt;In a previous post,&lt;/a&gt; I wrote about retiring other tools in favor of this CLI. I managed to create 49 issues in under one minute with a bash script. With MCP, it would’ve taken me an hour and a half.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter the Agent
&lt;/h2&gt;

&lt;p&gt;And now Linear rolls out their new AI agent. The pitch: an integrated assistant that understands your workspace, connects with your codebase, and automates workflows.&lt;/p&gt;

&lt;p&gt;Check this out: &lt;strong&gt;you know what the agent &lt;em&gt;doesn’t&lt;/em&gt; do?&lt;/strong&gt; Work via the terminal. It’s not a tool for &lt;em&gt;your&lt;/em&gt; AI agent. It’s a Linear AI agent that lives entirely within Linear.&lt;/p&gt;

&lt;p&gt;If you’re working with Claude Code, Codex, or any &lt;em&gt;coding agent&lt;/em&gt; in the terminal, Linear’s agent isn’t helpful to you at all. Your agent can’t invoke Linear’s agent to create an issue. It’s not composable. It’s not a Lego piece that plugs into your workflow. It’s a closed product within a closed product.&lt;/p&gt;

&lt;p&gt;That is to say: Linear built an agent for product managers working inside the Linear app — not for developers working in the terminal with AI agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  You Already Had Your Agent
&lt;/h2&gt;

&lt;p&gt;Here’s the epiphany I had while reading that announcement: I already &lt;em&gt;have&lt;/em&gt; an agent for Linear. It’s called Claude Code.&lt;/p&gt;

&lt;p&gt;I don’t need Linear to put a chatbot inside their app for me. What I need is for Linear’s &lt;em&gt;programmable interface&lt;/em&gt; to not be a hack job. To simply ensure that when I tell my agent, “Create an issue with these details,” it just works. Every time, hassle-free.&lt;/p&gt;

&lt;p&gt;And that’s precisely what a good CLI is supposed to do. My agent — Claude Code — already knows how to use the terminal. It already knows how to execute commands. It already knows how to parse &lt;em&gt;output&lt;/em&gt;. All it needs is a reliable tool on the other side.&lt;/p&gt;

&lt;p&gt;I tell Claude Code, “Create an issue in Linear with high priority,” and it executes a terminal command. It works. Next task. No chatbot, no fancy GUI, no Slack. One command, one result.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future is CLI (Surprisingly)
&lt;/h2&gt;

&lt;p&gt;Here’s a hot take: in a world where everyone is building AI agents with conversational interfaces inside their apps, the future for developers is, paradoxically, the &lt;em&gt;command-line interface&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Why? Because the CLI is the universal interface for agents. Your &lt;em&gt;coding agent&lt;/em&gt; can’t click buttons. It can’t navigate a web app. It can’t use a chatbot embedded in another app. But it &lt;em&gt;can&lt;/em&gt; execute a command and read its &lt;em&gt;output&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The CLI is the most democratic API out there. No SDKs, no 15 OAuth redirects, no MCP that breaks every other Tuesday. One binary, a few flags, &lt;em&gt;stdin&lt;/em&gt;/&lt;em&gt;stdout&lt;/em&gt;. Unix nailed it 50 years ago because it works.&lt;/p&gt;

&lt;p&gt;The real problem is that most SaaS tool CLIs are an afterthought. “Oh, you also need a CLI? Fine, let an intern slap a wrapper on our REST API.” And the result? Tools that spew unreadable JSON, lack autocomplete, fail silently, or require a token that expires every 37 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  500+ Errors No One Noticed
&lt;/h2&gt;

&lt;p&gt;But before talking about rewriting anything, I wanted data. Not gut feelings — actual data. So I did something only someone with an LLM and 1 million context tokens would think to do: I asked Claude Code to parse its own past sessions and identify &lt;em&gt;every time&lt;/em&gt; it failed while interacting with Linear.&lt;/p&gt;

&lt;p&gt;165 sessions. 11 projects. Months of history. And the results were... eye-opening.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;500+ errors. 370+ retries.&lt;/strong&gt; A conservative estimate of 700,000 tokens wasted per month just battling Linear.&lt;/p&gt;

&lt;p&gt;The errors break down into categories that are downright cringeworthy when viewed together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The classic: &lt;code&gt;--sort&lt;/code&gt; forgotten.&lt;/strong&gt; Linear CLI requires &lt;code&gt;--sort priority&lt;/code&gt; on every &lt;code&gt;list&lt;/code&gt;. No default. Omitted it? Error. Claude forgot it &lt;strong&gt;40 times&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The translator: UI vs CLI states.&lt;/strong&gt; In Linear’s UI, states are labeled as "Todo," "In Progress," and "Done." But in the CLI, they’re &lt;code&gt;unstarted&lt;/code&gt;, &lt;code&gt;started&lt;/code&gt;, &lt;code&gt;completed&lt;/code&gt;. Claude used the UI names 12 times. &lt;code&gt;--state "Todo"&lt;/code&gt; → error. &lt;code&gt;--state "In Progress"&lt;/code&gt; → error. Same mistakes, over and over.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The optimist: flags that don’t exist.&lt;/strong&gt; &lt;code&gt;--status&lt;/code&gt; instead of &lt;code&gt;--state&lt;/code&gt; (11 times). &lt;code&gt;--priority urgent&lt;/code&gt; instead of &lt;code&gt;--priority 1&lt;/code&gt; (17 times). &lt;code&gt;--no-pager&lt;/code&gt; on unsupported commands (15 times). And the list goes on — all errors due to guesswork.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the cherry on top? &lt;strong&gt;171 calls to Linear’s MCP — which had already been uninstalled.&lt;/strong&gt; Across four projects. Even after I typed out: “Linear’s MCP is trash, use the API.”&lt;/p&gt;

&lt;h2&gt;
  
  
  How lql Addresses All of This
&lt;/h2&gt;

&lt;p&gt;It’s one thing to complain. But every problem has a straightforward solution.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;th&gt;Solution in lql&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--sort&lt;/code&gt; forgotten&lt;/td&gt;
&lt;td&gt;40+&lt;/td&gt;
&lt;td&gt;Default &lt;code&gt;priority&lt;/code&gt;. No arguments needed for &lt;code&gt;lql list&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI vs CLI state mismatch&lt;/td&gt;
&lt;td&gt;12+&lt;/td&gt;
&lt;td&gt;Automatic aliasing. &lt;code&gt;Todo&lt;/code&gt; → &lt;code&gt;unstarted&lt;/code&gt;, &lt;code&gt;Done&lt;/code&gt; → &lt;code&gt;completed&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--priority urgent&lt;/code&gt; mistake&lt;/td&gt;
&lt;td&gt;17+&lt;/td&gt;
&lt;td&gt;Automatic aliasing. &lt;code&gt;urgent&lt;/code&gt; → &lt;code&gt;1&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt; → &lt;code&gt;2&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--no-interactive&lt;/code&gt; absent&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;No interactive mode. Commands never hang.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broken JSON escaping&lt;/td&gt;
&lt;td&gt;25+ (80+ retries)&lt;/td&gt;
&lt;td&gt;Native GraphQL variables. No broken strings, only properly built JSON.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Everything boils down to reducing friction. A good CLI should make it easy. Nothing more, nothing less.&lt;/p&gt;

&lt;h2&gt;
  
  
  If You’re Rewriting, Use Rust
&lt;/h2&gt;

&lt;p&gt;Here’s where the unexpected twist comes in (or expected, if you know me).&lt;/p&gt;

&lt;p&gt;If a CLI is the critical bridge between your agent and your &lt;em&gt;issue tracker&lt;/em&gt;, then it should be written with care. In a language that prevents you from shipping garbage. With proper error handling. With a static binary that doesn’t rely on Node or Python runtime environments.&lt;/p&gt;

&lt;p&gt;Let’s beat the dead horse here: if we’re rewriting anything, it should be in Rust.&lt;/p&gt;

&lt;p&gt;And the project name? &lt;strong&gt;lql&lt;/strong&gt; — &lt;em&gt;Linear Query Language&lt;/em&gt;. Like SQL, but for your &lt;em&gt;issue tracker&lt;/em&gt;. SQL is the language for querying databases; lql is the language for querying your backlog.&lt;/p&gt;

&lt;p&gt;Oh, and one last juicy note: Linear's official CLI? It’s &lt;strong&gt;157 MB&lt;/strong&gt; (bundled Node.js runtime). &lt;code&gt;lql&lt;/code&gt;? Just &lt;strong&gt;4.7 MB&lt;/strong&gt;. A static binary, 33 times smaller, with no JavaScript baggage.&lt;/p&gt;

&lt;p&gt;Ferris the crab approves. 🦀&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Series: Adversarial Programming&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Next: &lt;a href="https://dev.to/en/adversarial-programming-ai-copilot-invents-api/"&gt;Adversarial programming: when your AI copilot invents APIs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Previous: &lt;a href="https://dev.to/en/wrong-path-impossible-not-forbidden/"&gt;The wrong path shouldn’t be forbidden, it should be impossible&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>linear</category>
      <category>cli</category>
      <category>rust</category>
      <category>ai</category>
    </item>
    <item>
      <title>Five Nonexistent Experts Review Your Startup Before You Build It</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:04:36 +0000</pubDate>
      <link>https://dev.to/frr149/five-nonexistent-experts-review-your-startup-before-you-build-it-4cem</link>
      <guid>https://dev.to/frr149/five-nonexistent-experts-review-your-startup-before-you-build-it-4cem</guid>
      <description>&lt;p&gt;In November 2024, a project named &lt;strong&gt;Freysa&lt;/strong&gt; assigned an LLM agent to guard an Ethereum wallet. The instruction was straightforward: under no circumstance should the funds be transferred. Participants paid increasing amounts for each attempt to convince it otherwise. After 481 attempts and $47,000 added to the pot, someone managed to trick the model into believing that the &lt;em&gt;reject&lt;/em&gt; function was actually the &lt;em&gt;transfer&lt;/em&gt; function.&lt;/p&gt;

&lt;p&gt;Weeks later, Jane Street published a puzzle involving a 2,500-layer neural network that turned out to be an MD5 implementation. The winner solved it by combining matrix visualization, reduction to SAT, cryptographic pattern recognition, and a query to ChatGPT.&lt;/p&gt;

&lt;p&gt;Both projects generated more buzz than most startups with million-dollar funding rounds. The obvious question is: how do you evaluate an idea like this &lt;em&gt;before&lt;/em&gt; you build it? How do you know if it has real viral potential or if it’s just an interesting technical exercise no one will share?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Evaluating MVPs in the Viral Era
&lt;/h2&gt;

&lt;p&gt;Most frameworks for evaluating product ideas assume a rational market. Business Model Canvas, Lean Canvas, Jobs To Be Done — these are all great tools for products with predictable demand. But they fail for projects where viral distribution &lt;em&gt;is&lt;/em&gt; the product.&lt;/p&gt;

&lt;p&gt;Freysa didn’t have "customers" in the traditional sense. It didn’t solve a "job to be done." Its mechanism relied on the act of participation itself generating attention, which attracted more participants. It was a circular economy: more attempts created a bigger pot, a bigger pot attracted media coverage, and media coverage brought in more attempts.&lt;/p&gt;

&lt;p&gt;To evaluate such projects, you need perspectives that generate &lt;strong&gt;tension&lt;/strong&gt;, not consensus. A business analyst will tell you there’s no sustainable revenue model. A viral expert will say sustainability doesn’t matter if the k-factor is greater than 1. Both are right. And the truth lies somewhere in the conflict, emerging only through that friction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea: An Adversarial Council of Simulated Experts
&lt;/h2&gt;

&lt;p&gt;I’ve designed a tool that simulates a council of five experts, each equipped with a specific decision-making framework and a defined jurisdiction. These aren’t just generic personalities with famous names. Each applies a set of precise decision filters that sift through noise that generic analysis would miss.&lt;/p&gt;

&lt;p&gt;The process has three phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Independent Analysis:&lt;/strong&gt; Each expert evaluates the idea through their lens, without seeing the others' input. This prevents anchoring — if the business expert speaks first and says, "This is amazing," the legal expert might soften their objections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial Debate:&lt;/strong&gt; The experts review each other’s analyses and critique them. No politeness, just arguments based on merit. A maximum of 10 rounds are allowed to reach either consensus or deadlock.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesis:&lt;/strong&gt; The final output is an actionable plan with flagged issues by area, a timeline, and — most importantly — &lt;strong&gt;kill criteria&lt;/strong&gt;: specific metrics that, if unmet, mean the project should be abandoned.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Five Selected (and Why They Were Chosen)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Paul Graham — Business and Strategy
&lt;/h3&gt;

&lt;p&gt;His framework for evaluating zero-stage startups is the most rigorous for projects with no data. His question, "Are you doing something people want?" is brutal but necessary. "The people" isn’t a market — it’s a person with a name.&lt;/p&gt;

&lt;p&gt;What he brings to the council: discipline in distinguishing between "interesting idea" and "viable business." His mantra of "do things that don’t scale" is crucial for viral MVPs, where the temptation is to build infrastructure for a million users that don’t yet exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who Didn’t Make the Cut:&lt;/strong&gt; Peter Thiel (too contrarian — sometimes he dismisses good projects for not being sufficiently "zero to one"), Alex Hormozi (focused on service businesses, not tech products focused on virality).&lt;/p&gt;

&lt;h3&gt;
  
  
  Lawrence Lessig — Legal and Regulatory
&lt;/h3&gt;

&lt;p&gt;He’s not a lawyer who just says, "This isn’t possible." Instead, he views regulation as &lt;strong&gt;architecture&lt;/strong&gt;. His "four modalities of regulation" framework (law, social norms, market, and code/architecture) helps analyze how to design systems where regulation won’t be a bottleneck, instead of trying to dodge it.&lt;/p&gt;

&lt;p&gt;What he brings to the council: the question, "What happens when the regulator notices you?" Many crypto/AI projects are legally irrelevant at small scale but become regulated when large. Lessig identifies the threshold where regulation gets triggered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who Didn’t Make the Cut:&lt;/strong&gt; A generic corporate lawyer (they’d kill any project early with a barrage of "no's"). Lessig goes beyond the law, recognizing that system design can make legal intervention unnecessary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Seth Godin — Marketing and Positioning
&lt;/h3&gt;

&lt;p&gt;His core question — "Who is your &lt;em&gt;smallest viable audience&lt;/em&gt; and why do they care?" — is perhaps the most critical for a viral launch. He doesn’t think about "reaching millions"; he focuses on "reaching the first 100 people who truly care."&lt;/p&gt;

&lt;p&gt;What he brings to the council: the remarkability test. Is this something that someone will share without you asking? "Useful" doesn’t get shared. "Remarkable" does. His concept of "Tribes" perfectly aligns with tech/crypto communities that already have strong group identities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who Didn’t Make the Cut:&lt;/strong&gt; Philip Kotler (too corporate — thinks in terms of traditional multinational marketing), April Dunford (her positioning framework is incredible but geared towards repositioning existing products, not launching new ones).&lt;/p&gt;

&lt;h3&gt;
  
  
  Balaji Srinivasan — Hype and Virality
&lt;/h3&gt;

&lt;p&gt;The most aggressive adviser on the panel, Balaji understands natively crypto-inspired distribution mechanisms: FOMO, tokenized incentives, network effects, and how something goes from zero to trending within 48 hours.&lt;/p&gt;

&lt;p&gt;What he brings to the council: the question, "What makes someone screenshot this and post it on Twitter in the next five minutes?" This is the atomic unit of virality. If your product doesn’t inspire spontaneous screenshots, you’ll need a marketing budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who Didn’t Make the Cut:&lt;/strong&gt; GaryVee (understands attention but not the crypto+AI intersection where viral mechanisms thrive today), Mr. Beast (his expertise is video content virality, not tech products), Nir Eyal (his "Hooked" framework targets retention, not launch virality — separate problems).&lt;/p&gt;

&lt;h3&gt;
  
  
  DHH (David Heinemeier Hansson) — Technical
&lt;/h3&gt;

&lt;p&gt;His obsession is "the simplest thing that works." For an MVP, the greatest technical risk isn’t picking the wrong stack — it’s never launching because you spent three months choosing one.&lt;/p&gt;

&lt;p&gt;What he brings to the council: the question, "Can one person build this in two weeks?" If not, the scope is too large, or the stack is overly complicated. His rule of "boring technology" (PostgreSQL, not CockroachDB; Redis, not Dragonfly) counters "we’re using blockchain because we can" syndrome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who Didn’t Make the Cut:&lt;/strong&gt; Werner Vogels (focuses on scalability from day one, which isn’t needed for MVPs), Kelsey Hightower (deep Kubernetes expertise, which usually results in over-engineering an MVP — using a sledgehammer to crack a nut).&lt;/p&gt;

&lt;h2&gt;
  
  
  Productive Tensions: Where Truth Emerges
&lt;/h2&gt;

&lt;p&gt;The tensions between council members aren’t a flaw in the design. They &lt;em&gt;are&lt;/em&gt; the design.&lt;/p&gt;

&lt;h3&gt;
  
  
  Balaji vs. Lessig: Virality vs. Regulation
&lt;/h3&gt;

&lt;p&gt;This is the primary tension. Balaji will push for FOMO mechanics involving real money (visible prize pools, pay-to-play, tokens). Lessig will point out that in the EU, pay-to-play with accumulating prize pools qualifies as gambling and requires a gaming license.&lt;/p&gt;

&lt;p&gt;The productive resolution isn’t one side "winning." It’s a design that satisfies both — for example, free challenges with sponsored prize pools (legal in most jurisdictions) instead of direct entry fees (regulated as gambling in many countries).&lt;/p&gt;

&lt;h3&gt;
  
  
  Godin vs. DHH: Remarkable vs. Spartan
&lt;/h3&gt;

&lt;p&gt;Godin will want a memorable experience — a public leaderboard with animations, participant profiles, achievement badges. DHH will advocate for a static page with SQLite and a form.&lt;/p&gt;

&lt;p&gt;The resolution: Can you achieve remarkability with boring tech? The answer is almost always yes. The challenge itself is the remarkable element, not the interface. A leaderboard in an HTML table with no JavaScript can be more notable than a Three.js dashboard if the content displayed is genuinely impressive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Paul Graham vs. Balaji: Unit Economics vs. Growth
&lt;/h3&gt;

&lt;p&gt;PG will ask for a clear revenue model from day one. Balaji will argue that viral distribution &lt;em&gt;is&lt;/em&gt; the model — audience first, monetization later.&lt;/p&gt;

&lt;p&gt;Both have precedents to back them up. Instagram had no revenue model when it reached 100 million users. But for every Instagram, there are 10,000 projects that scaled without revenue and ultimately failed.&lt;/p&gt;

&lt;p&gt;The usual resolution is temporal: validate virality first (giving Balaji the win), but impose a strict timeline for demonstrating unit economics (giving PG the eventual win). The kill criteria formalize this agreement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Most Valuable Output: Kill Criteria
&lt;/h2&gt;

&lt;p&gt;Most side projects die slowly. There’s no clear moment when they fail. The founder just stops dedicating time because "other things came up." Three months later, the domain expires, and no one notices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kill criteria&lt;/strong&gt; are the opposite: concrete thresholds, with defined deadlines, that signal when to stop.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;th&gt;Deadline&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Beta participants&lt;/td&gt;
&lt;td&gt;&amp;lt;50 in 2 weeks&lt;/td&gt;
&lt;td&gt;Week 2&lt;/td&gt;
&lt;td&gt;Pivot or stop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Launch shares&lt;/td&gt;
&lt;td&gt;&amp;lt;100&lt;/td&gt;
&lt;td&gt;Week 4&lt;/td&gt;
&lt;td&gt;Reevaluate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retention rate&lt;/td&gt;
&lt;td&gt;&amp;lt;10% 30-day retention&lt;/td&gt;
&lt;td&gt;Week 8&lt;/td&gt;
&lt;td&gt;Stop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The rule: If two out of three kill criteria are unmet, the project halts. No exceptions. No "one more month." No "we didn’t do enough marketing."&lt;/p&gt;

&lt;p&gt;This is what separates a professional from an amateur. Amateurs fall in love with the idea. Professionals fall in love with the outcome. And if the outcome doesn’t materialize within the agreed timeframe, they have the discipline to move on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Simulations, Not Real People?
&lt;/h2&gt;

&lt;p&gt;The obvious objection: Why not talk to real people instead of simulating experts with an LLM?&lt;/p&gt;

&lt;p&gt;Three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Availability.&lt;/strong&gt; Paul Graham isn’t giving you two hours to analyze your side project. The simulation will. And while the simulation doesn’t have the original’s accumulated experience, it applies their published frameworks with a consistency busy people might not achieve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest Adversariality.&lt;/strong&gt; Real people soften their critiques out of politeness. A simulation configured to be adversarial will actually question everything. "You don’t have a functional revenue model" is something that an investor might think but not say out loud in a first meeting. The simulation says it in round one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero Marginal Cost.&lt;/strong&gt; You can run the council five times, tweaking variations of the same idea, and compare results. Trying to do that with real people would consume 25 hours of their time.&lt;/p&gt;

&lt;p&gt;Simulations don’t replace real advisors. But they prepare you for those conversations by eliminating obvious issues beforehand. It’s the difference between presenting a clean draft and showing up with an unfiltered first pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta-Pattern: Structured Debate as a Decision-Making Tool
&lt;/h2&gt;

&lt;p&gt;This design isn’t just for MVPs. I already use it for code reviews (three experts in simplicity, design, and performance) and design reviews (four experts in information density, usability, product, and interaction).&lt;/p&gt;

&lt;p&gt;The core pattern remains:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Experts with defined jurisdictions:&lt;/strong&gt; Each has domain-specific authority. Outside their domain, they have no vote.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit decision frameworks:&lt;/strong&gt; It’s not "what do you think," but "what does your framework say about this."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Planned Tensions:&lt;/strong&gt; Conflicts between experts are intentional. They’re the most valuable source of insight in the process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forced Convergence:&lt;/strong&gt; Maximum of N rounds. If no consensus is reached, the moderator decides and documents dissent as a risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actionable Output:&lt;/strong&gt; Not an essay but specific issues, deadlines, and success/failure criteria.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The difference between "asking one LLM to analyze your idea" and "having five specialized LLMs debate your idea" is not one of degree. It’s one of kind. The former produces an opinion. The latter produces a risk map and plan, exposing blind spots as the perspectives clash.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question You Should Be Asking Yourself
&lt;/h2&gt;

&lt;p&gt;Before you write the first line of code for your next project, ask yourself: Who’s going to tell you it’s a bad idea?&lt;/p&gt;

&lt;p&gt;If the answer is "no one, because I haven’t asked anyone," you already have a problem. If the answer is "my friends, who are super supportive," you have an even bigger problem.&lt;/p&gt;

&lt;p&gt;What you need isn’t support. It’s structured scrutiny — from people (real or simulated) who are incentivized to find flaws, not to validate your illusions. Five perspectives conflicting with one another will yield more truth than one that simply agrees with you.&lt;/p&gt;

&lt;p&gt;The cost of evaluating an idea is an afternoon. The cost of building a bad idea is months of your life you’ll never get back. The math is clear.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related Reading:&lt;/strong&gt; If you're curious about how adversarial thinking applies to debugging opaque systems, check out &lt;a href="https://dev.to/posts/reverse-engineer-neural-network-senior-debugging/"&gt;A 2,500-Layer Neural Network That Turns Out to Be MD5&lt;/a&gt;. And if you want to see how the same council pattern applies to code reviews, read &lt;a href="https://dev.to/en/simplify-jedi-council-ai-code-review/"&gt;Simplify: A Jedi Council for Code Reviews with AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>startup</category>
      <category>mvp</category>
    </item>
    <item>
      <title>Git Worktrees: How to Have Multiple AI Agents Working Simultaneously Without Conflicts</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 16:01:34 +0000</pubDate>
      <link>https://dev.to/frr149/git-worktrees-how-to-have-multiple-ai-agents-working-simultaneously-without-conflicts-21ah</link>
      <guid>https://dev.to/frr149/git-worktrees-how-to-have-multiple-ai-agents-working-simultaneously-without-conflicts-21ah</guid>
      <description>&lt;h2&gt;
  
  
  The Single Checkout Bottleneck
&lt;/h2&gt;

&lt;p&gt;I'm developing a macOS menu bar app. I have three features in the backlog: a consumption sparkline, native notifications, and a desktop widget. All three are independent. I'm building all three with Claude Code.&lt;/p&gt;

&lt;p&gt;The problem: Claude Code works in one directory. One directory has one branch. And &lt;code&gt;git checkout&lt;/code&gt; is like a single-lane roundabout: only one gets through.&lt;/p&gt;

&lt;p&gt;If I want to advance all three simultaneously, my classic options are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stash ping-pong&lt;/strong&gt;: &lt;code&gt;git stash&lt;/code&gt;, switch branches, work, &lt;code&gt;git stash pop&lt;/code&gt;, pray there are no conflicts. Repeat until madness or retirement, whichever comes first.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Clone the repo three times&lt;/strong&gt;: Works, but now I have three &lt;code&gt;.git/&lt;/code&gt; copies, three independent histories, and a &lt;code&gt;git fetch&lt;/code&gt; to do in each one. Wasteful.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Accept serial life&lt;/strong&gt;: One feature after another. Safe, predictable, and slow as a hand-written merge sort.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are great. But there's a fourth option that's been in git since 2015 and almost nobody uses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worktrees: The Solution You Already Had Installed
&lt;/h2&gt;

&lt;p&gt;A worktree is a second working directory that shares the same &lt;code&gt;.git&lt;/code&gt; repository. No copies, no clones, no black magic.&lt;/p&gt;

&lt;p&gt;The analogy: your repo is a library. Until now you had &lt;strong&gt;one desk&lt;/strong&gt; where you could only have one book open. A worktree is adding more desks. Each with a different book open, but all drawing from the same bookshelf.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/code/myapp/                    ← desk 1 (main)
     .git/                       ← the library (just one)

~/code/myapp-sparkline/          ← desk 2 (feature/sparkline)
     .git  ← file, not folder (pointer to library)

~/code/myapp-notifications/      ← desk 3 (feature/notifications)
     .git  ← another pointer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each directory is a complete checkout with all files. You can compile in one, run tests in another, and have your AI agent working in the third. Simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating One is a Single Line
&lt;/h2&gt;

&lt;p&gt;From your main repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git worktree add ../myapp-sparkline &lt;span class="nt"&gt;-b&lt;/span&gt; feature/sparkline
git worktree add ../myapp-notifications &lt;span class="nt"&gt;-b&lt;/span&gt; feature/notifications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done. Two new directories, each on its branch, sharing the entire git database. No cloning, no configuring remotes, no duplicating history.&lt;/p&gt;

&lt;h2&gt;
  
  
  What They Share and What They Don't
&lt;/h2&gt;

&lt;p&gt;This is important. Worktrees share &lt;strong&gt;the entire repo&lt;/strong&gt;: commits, branches, tags, remotes, hooks, configuration. If you commit in the sparkline worktree, you can see it immediately from the notifications one without doing &lt;code&gt;fetch&lt;/code&gt; or anything, because it's the same database.&lt;/p&gt;

&lt;p&gt;What they &lt;strong&gt;don't&lt;/strong&gt; share:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Files on disk (each desk has its working copy)&lt;/li&gt;
&lt;li&gt;The staging area (each has its own &lt;code&gt;git add&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The HEAD (each points to its branch)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simply put: the "what am I working on" state is private to each worktree. Everything else is shared.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workflow with Coding Agents
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. With worktrees, you can literally have multiple agents working in parallel on the same project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1: Claude Code on sparkline&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp-sparkline
claude

&lt;span class="c"&gt;# Terminal 2: Claude Code on notifications&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp-notifications
claude

&lt;span class="c"&gt;# Terminal 3: main intact, app running&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp
make run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each Claude instance has its own directory, its own branch, its own &lt;code&gt;.build/&lt;/code&gt;. They don't step on each other. They don't compete for the index. They don't need to stash anything.&lt;/p&gt;

&lt;p&gt;And since they share the git database, when one agent finishes and pushes, the others already see that branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Merging: Exactly the Same as Always
&lt;/h2&gt;

&lt;p&gt;Worktrees don't change the merge workflow at all. They're normal branches in separate directories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Option A: local merge&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp
git merge feature/sparkline
git merge feature/notifications

&lt;span class="c"&gt;# Option B: PRs (usual approach)&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp-sparkline
git push &lt;span class="nt"&gt;-u&lt;/span&gt; origin feature/sparkline
&lt;span class="c"&gt;# Create PR in GitHub/Gitea, review, merge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you're done, clean up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git worktree remove ../myapp-sparkline
git branch &lt;span class="nt"&gt;-d&lt;/span&gt; feature/sparkline  &lt;span class="c"&gt;# if already merged&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Pitfalls Nobody Tells You About
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. One Branch, One Worktree
&lt;/h3&gt;

&lt;p&gt;You can't have &lt;code&gt;main&lt;/code&gt; checked out in two worktrees simultaneously. This is by design: it prevents two directories from modifying the same HEAD and corrupting each other. If you need a second checkout of &lt;code&gt;main&lt;/code&gt;, create a temporary branch.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The First Build is From Scratch
&lt;/h3&gt;

&lt;p&gt;Each worktree has its own build directory. The first compilation will be slow. After that, each worktree maintains its independent cache, which is precisely the advantage over classic &lt;code&gt;git checkout&lt;/code&gt; (which invalidates the cache every time you switch branches).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Local Untracked Files
&lt;/h3&gt;

&lt;p&gt;Your &lt;code&gt;.env.local&lt;/code&gt;, editor configurations, files not in git... don't get copied to the new worktree. You'll need to recreate them or make symlinks.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Apps with Shared Disk State
&lt;/h3&gt;

&lt;p&gt;If your app writes data to &lt;code&gt;~/Library/Application Support/&lt;/code&gt; or similar, two app instances from different worktrees will compete for the same file. This isn't a worktree problem, it's a problem of running two instances of the same app. Solution: don't run two simultaneously, or parameterize the data directory per build.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Don't Delete the Directory Manually
&lt;/h3&gt;

&lt;p&gt;If you &lt;code&gt;rm -rf&lt;/code&gt; the worktree instead of using &lt;code&gt;git worktree remove&lt;/code&gt;, git still thinks the branch is occupied. Run &lt;code&gt;git worktree prune&lt;/code&gt; to clean up orphaned references.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. The Remote Knows Nothing
&lt;/h3&gt;

&lt;p&gt;Worktrees are 100% local. Gitea, GitHub, GitLab... no remote knows they exist. They only see normal &lt;code&gt;git push&lt;/code&gt; commands with normal branches. It's like asking if your server has problems with you using Vim or VS Code: it doesn't know, it doesn't care.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Naming convention&lt;/strong&gt;: Put worktrees as siblings of the original repo, with a descriptive suffix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/code/myapp/                    ← main
~/code/myapp-sparkline/          ← feature
~/code/myapp-notifications/      ← feature
~/code/myapp-hotfix-login/       ← hotfix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This way &lt;code&gt;ls ~/code/myapp*&lt;/code&gt; shows you everything at a glance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One worktree per feature, not per whim&lt;/strong&gt;: Create worktrees for work that will actually be parallel. If you're going to do things sequentially, a normal branch with &lt;code&gt;checkout&lt;/code&gt; is sufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean up when done&lt;/strong&gt;: Abandoned worktrees are like branches nobody deletes — they accumulate and confuse. &lt;code&gt;git worktree list&lt;/code&gt; is your friend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't edit the same file from two worktrees&lt;/strong&gt;: Technically you can, each has its copy. But if both modify the same file, you'll have conflicts when merging. Try to have features touch different areas of the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Complete Workflow Proposal
&lt;/h2&gt;

&lt;p&gt;For those who want an organized workflow, here's what I use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Create worktrees for sprint features&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp
git worktree add ../myapp-feat-a &lt;span class="nt"&gt;-b&lt;/span&gt; feature/feat-a
git worktree add ../myapp-feat-b &lt;span class="nt"&gt;-b&lt;/span&gt; feature/feat-b

&lt;span class="c"&gt;# 2. Launch an agent in each&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp-feat-a &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; claude    &lt;span class="c"&gt;# terminal 1&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp-feat-b &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; claude    &lt;span class="c"&gt;# terminal 2&lt;/span&gt;

&lt;span class="c"&gt;# 3. Merge as they finish&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/code/myapp-feat-a
git push &lt;span class="nt"&gt;-u&lt;/span&gt; origin feature/feat-a   &lt;span class="c"&gt;# create PR&lt;/span&gt;

&lt;span class="c"&gt;# 4. Clean up what's already merged&lt;/span&gt;
git worktree remove ../myapp-feat-a
git branch &lt;span class="nt"&gt;-d&lt;/span&gt; feature/feat-a

&lt;span class="c"&gt;# 5. See what's still active&lt;/span&gt;
git worktree list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cycle is: &lt;strong&gt;create → work in parallel → push/PR → merge → clean up&lt;/strong&gt;. Each worktree lives as long as the feature, no more, no less.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Worktrees have been in git since version 2.5 (July 2015). More than ten years. And most people still do &lt;code&gt;git stash&lt;/code&gt; like we're in 2010.&lt;/p&gt;

&lt;p&gt;With the arrival of coding agents, the bottleneck is no longer the speed at which you write code — it's the speed at which you can context switch. And worktrees eliminate that context switch completely: you don't switch branches, you switch directories. &lt;code&gt;cd&lt;/code&gt; instead of &lt;code&gt;checkout&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Which is, ultimately, what we should have been doing all along.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;TL;DR&lt;/strong&gt;: &lt;code&gt;git worktree add ../name -b branch&lt;/code&gt; creates a second working directory on the same repo. No copies, no stash, no invalidating caches. Perfect for having multiple coding agents working in parallel. Clean up with &lt;code&gt;git worktree remove&lt;/code&gt; when done.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;This article was originally written in Spanish and translated with the help of AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>git</category>
      <category>ai</category>
      <category>productivity</category>
      <category>tools</category>
    </item>
    <item>
      <title>I'm paying $15 per million tokens to write 'fix: typo'</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 15:59:32 +0000</pubDate>
      <link>https://dev.to/frr149/im-paying-15-per-million-tokens-to-write-fix-typo-3n64</link>
      <guid>https://dev.to/frr149/im-paying-15-per-million-tokens-to-write-fix-typo-3n64</guid>
      <description>&lt;p&gt;Yesterday I wrote a commit message with Claude Code. The diff was a one-line change: a typo in a comment. Claude Opus read the diff, thought for two seconds, and generated &lt;code&gt;fix: correct typo in auth comment&lt;/code&gt;. That consumed about 800 input tokens and 30 output tokens, at $15 and $75 per million respectively. Cost: a fraction of a cent. But multiply that by 40 commits per day, 250 days per year, across a company with 200 developers using coding agents, and the fraction of a cent becomes thousands of dollars spent on the intellectual equivalent of applying band-aids.&lt;/p&gt;

&lt;p&gt;The problem isn't that Opus is expensive. The problem is that coding agents don't distinguish between $0.001 tasks and $0.10 tasks. Everything goes through the same model. Generate a commit message, classify an issue, validate a format -- everything hits the big model at the same cost as designing a microservices architecture. It's the equivalent of hiring a surgeon to apply band-aids.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Let's run the numbers with Claude Opus 4 pricing (the previous generation, which most teams still use in production):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Output tokens&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Commit message (small diff)&lt;/td&gt;
&lt;td&gt;~800&lt;/td&gt;
&lt;td&gt;~30&lt;/td&gt;
&lt;td&gt;$0.014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Classify an issue&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;~50&lt;/td&gt;
&lt;td&gt;$0.011&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validate commit format&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;td&gt;~20&lt;/td&gt;
&lt;td&gt;$0.006&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standup summary&lt;/td&gt;
&lt;td&gt;~2000&lt;/td&gt;
&lt;td&gt;~200&lt;/td&gt;
&lt;td&gt;$0.045&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these tasks need a model with 2 trillion parameters and multi-step reasoning capability. They're classification and constrained generation tasks. The equivalent of sorting cards by color.&lt;/p&gt;

&lt;p&gt;With Apple Intelligence's on-device model (3B parameters, included in macOS 26): cost $0.00, latency ~300ms, no network, no API key.&lt;/p&gt;

&lt;h2&gt;
  
  
  foundation-hooks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/frr/foundation-hooks" rel="noopener noreferrer"&gt;foundation-hooks&lt;/a&gt; is a set of 4 Swift binaries that use Apple's Foundation Models framework to automate development tasks that don't justify a cloud model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Binary&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Git hook&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fm-commit-msg&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generates conventional commit messages from diff&lt;/td&gt;
&lt;td&gt;&lt;code&gt;prepare-commit-msg&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fm-validate-msg&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Validates format and suggests corrections&lt;/td&gt;
&lt;td&gt;&lt;code&gt;commit-msg&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fm-lql-create&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Classifies and creates Linear issues via &lt;a href="https://github.com/frr/lql" rel="noopener noreferrer"&gt;lql&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fm-lql-standup&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generates standup summary from git log + issues&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All four share the same pattern: define a Swift struct with &lt;code&gt;@Generable&lt;/code&gt;, feed the model minimal context, get structured output in milliseconds.&lt;/p&gt;

&lt;p&gt;Installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/frr/foundation-hooks
&lt;span class="nb"&gt;cd &lt;/span&gt;foundation-hooks
make build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make install-hooks &lt;span class="nv"&gt;REPO&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/your/repo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From that point on, every &lt;code&gt;git commit&lt;/code&gt; automatically generates a conventional message. The hook has been installed in 11 production repositories for two weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works: @Generable and constrained decoding
&lt;/h2&gt;

&lt;p&gt;This is the part that deserves technical attention. &lt;code&gt;@Generable&lt;/code&gt; isn't "ask the model to return JSON and hope for the best". It's &lt;strong&gt;constrained decoding&lt;/strong&gt; -- the model literally cannot generate tokens that violate the schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  The mechanism
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;@Generable&lt;/code&gt; is a Swift macro that generates a JSON Schema at compile time from the struct.&lt;/li&gt;
&lt;li&gt;The framework injects that schema into the prompt as a response format specification.&lt;/li&gt;
&lt;li&gt;During inference, at each decoding step, &lt;strong&gt;token masking&lt;/strong&gt; is applied: vocabulary tokens that would produce invalid output according to the schema are masked (probability 0 in the softmax).&lt;/li&gt;
&lt;li&gt;The model can only choose from valid tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Apple describes this as "guided generation" in the &lt;a href="https://developer.apple.com/videos/play/wwdc2025/301/" rel="noopener noreferrer"&gt;WWDC25 documentation&lt;/a&gt;. It's the same technique OpenAI uses with &lt;code&gt;response_format: json_schema&lt;/code&gt; and Anthropic applies in tool use. The difference: Apple integrates it into Swift's type system. Define the struct, the compiler generates the schema, the runtime applies it during inference. Type safety end-to-end.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three levels of constraint
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;@Generable&lt;/span&gt;
&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;CommitMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Level 1: HARD constraint — effective enum&lt;/span&gt;
    &lt;span class="c1"&gt;// Active token masking: only "fix", "feat", "refactor", etc.&lt;/span&gt;
    &lt;span class="c1"&gt;// Tokens that would form "bug" or "update" have probability 0.&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anyOf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"fix"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"feat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"refactor"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"docs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"chore"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"style"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

    &lt;span class="c1"&gt;// Level 2: SOFT constraint — like a system prompt for this field&lt;/span&gt;
    &lt;span class="c1"&gt;// The model tends to follow it but isn't forced to.&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Scope of the change, e.g. auth, ui, db. One word, lowercase."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

    &lt;span class="c1"&gt;// Level 3: no constraint — free string, the model decides&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analogy: &lt;code&gt;anyOf&lt;/code&gt; is a dropdown, &lt;code&gt;description&lt;/code&gt; is an input with placeholder, and a field without Guide is an empty textarea. The difference between the three isn't one of degree but of mechanism. The first operates at the token level (the model cannot deviate), the second operates at the prompt level (the model tends to follow it), the third has no guidance.&lt;/p&gt;

&lt;p&gt;This is relevant because the git hooks use case is exactly the scenario where hard constraints shine. A commit type must be one of 7 values. No ambiguity, no creativity, no reasoning. It's pure classification. A 3B parameter model with constrained decoding does this as well as a 200B model. The difference is one takes 300ms and is free, the other takes 2 seconds and costs money.&lt;/p&gt;

&lt;h2&gt;
  
  
  Complete code for a hook
&lt;/h2&gt;

&lt;p&gt;This is &lt;code&gt;fm-commit-msg&lt;/code&gt;, the &lt;code&gt;prepare-commit-msg&lt;/code&gt; hook. It's 106 lines of Swift with no external dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;Foundation&lt;/span&gt;
&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;FoundationModels&lt;/span&gt;

&lt;span class="kd"&gt;@Generable&lt;/span&gt;
&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;CommitMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Type of change"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anyOf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"fix"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"feat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"refactor"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"docs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"chore"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"style"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Scope of the change, e.g. auth, ui, db, api. One word, lowercase."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Imperative summary of the change, max 50 chars, lowercase, no period"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isAvailable&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// No Apple Intelligence — exit silently, user writes their own&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things worth highlighting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graceful degradation&lt;/strong&gt;: if Apple Intelligence isn't available (Mac without Apple Silicon, model not downloaded), the hook exits with code 0 and git continues normally. Never blocks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Doesn't fabricate&lt;/strong&gt;: the model receives &lt;code&gt;git diff --cached --stat&lt;/code&gt; and a patch truncated to 3000 characters. Enough to classify and summarize, insufficient to confabulate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Doesn't replace the human&lt;/strong&gt;: the message is written to the commit file with git comments (&lt;code&gt;#&lt;/code&gt;), so &lt;code&gt;git commit&lt;/code&gt; displays it in the editor. The user can modify or discard it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"""
    You generate git commit messages in conventional commits format.
    Focus on WHY the change was made, not WHAT changed.
    The subject must be imperative mood, lowercase, no period, max 50 chars.
    """&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;generating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CommitMessage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;): &lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;session.respond(to:generating:)&lt;/code&gt; returns a &lt;code&gt;CommitMessage&lt;/code&gt; instance, not a &lt;code&gt;String&lt;/code&gt;. No parsing. No regex. No &lt;code&gt;try? JSONDecoder().decode(...)&lt;/code&gt;. The struct is the contract and the compiler guarantees it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Issue tracking integration: fm-lql-create
&lt;/h2&gt;

&lt;p&gt;The same pattern works for issue tracking. &lt;code&gt;fm-lql-create&lt;/code&gt; classifies a natural language description and creates a Linear issue via &lt;a href="https://github.com/frr/lql" rel="noopener noreferrer"&gt;lql&lt;/a&gt;, a Linear CLI written in Rust:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;@Generable&lt;/span&gt;
&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;IssueClassification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anyOf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"bug"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"feature"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"improvement"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"chore"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anyOf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"urgent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Clean, professional issue title. Max 80 chars."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"One-line description for the issue body"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;fm-lql-create &lt;span class="s2"&gt;"auth token refresh crashes when expired"&lt;/span&gt;
PROD | high | bug | TOK: Auth: token refresh crashes on expiry
Token refresh fails silently when the OAuth token has expired, causing auth loop.

Press Enter to create, Ctrl-C to cancel:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local model classifies the issue in ~500ms: type bug, priority high, clean title, one-line description. Then &lt;code&gt;lql create&lt;/code&gt; creates it in Linear. The &lt;code&gt;--dry-run&lt;/code&gt; flag shows the proposal without executing anything.&lt;/p&gt;

&lt;p&gt;Two fields with &lt;code&gt;anyOf&lt;/code&gt; (type, priority) guarantee the classification is valid. It cannot return "priority: very important" or "type: bugfix". The tokens are masked. Two fields with &lt;code&gt;description&lt;/code&gt; (title, description) give controlled freedom to the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before and after
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;With coding agent (Opus)&lt;/th&gt;
&lt;th&gt;With foundation-hooks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generate commit message&lt;/td&gt;
&lt;td&gt;~2s, ~800 tokens, ~$0.014&lt;/td&gt;
&lt;td&gt;~300ms, 0 tokens, $0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validate format&lt;/td&gt;
&lt;td&gt;~1.5s, ~300 tokens, ~$0.006&lt;/td&gt;
&lt;td&gt;~200ms, 0 tokens, $0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Classify issue&lt;/td&gt;
&lt;td&gt;~2s, ~500 tokens, ~$0.011&lt;/td&gt;
&lt;td&gt;~500ms, 0 tokens, $0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generate standup&lt;/td&gt;
&lt;td&gt;~3s, ~2000 tokens, ~$0.045&lt;/td&gt;
&lt;td&gt;~800ms, 0 tokens, $0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Requires network&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Requires API key&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Works on airplane&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The local model times are actual measurements on a MacBook Pro M4 Pro. Not synthetic benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it can't do
&lt;/h2&gt;

&lt;p&gt;Apple's on-device model is a 3B parameter model with a 4096-token context window. It has clear limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Large diffs&lt;/strong&gt;: above ~3000 characters of patch, the context is truncated. For massive refactors touching 20 files, the model only sees the statistical summary (&lt;code&gt;--stat&lt;/code&gt;), not the complete patch. The commit message will be generic but correct in format.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Architectural decisions&lt;/strong&gt;: "Should I use a protocol or a concrete type here?" is a question that needs project context, codebase history, and multi-step reasoning. That's still big model territory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Code generation&lt;/strong&gt;: foundation-hooks doesn't generate code. It generates metadata &lt;em&gt;about&lt;/em&gt; code: commit messages, classifications, summaries. The boundary is clear: if the task is to "write" something a human will review, use the big model. If the task is to "label" something a human already wrote, use the local model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;macOS 26+ with Apple Silicon only&lt;/strong&gt;: doesn't work on Linux, doesn't work on Intel Macs. For heterogeneous teams, the hook exits silently and the user writes their own message.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Prerequisites: macOS 26, Xcode 26, Apple Intelligence enabled&lt;/span&gt;
git clone https://github.com/frr/foundation-hooks
&lt;span class="nb"&gt;cd &lt;/span&gt;foundation-hooks
make build

&lt;span class="c"&gt;# Install hooks in a specific repo&lt;/span&gt;
make install-hooks &lt;span class="nv"&gt;REPO&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/your/repo

&lt;span class="c"&gt;# Install CLI binaries to ~/.local/bin&lt;/span&gt;
make install-lql

&lt;span class="c"&gt;# Install hooks in all known repos (edit Makefile to adjust the list)&lt;/span&gt;
make install-all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Makefile&lt;/code&gt; copies the compiled binaries directly to &lt;code&gt;.git/hooks/&lt;/code&gt;. No runtime, no daemon, no configuration. If the binary is in the hook, it works. If you don't want AI on a commit, &lt;code&gt;git commit --no-verify&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thesis
&lt;/h2&gt;

&lt;p&gt;Coding agents are extraordinary tools for tasks requiring complex reasoning. But the current pricing model doesn't distinguish between complexity. Every interaction with the model -- from designing an architecture to writing "fix: typo" -- goes through the same pipeline, at the same cost, with the same latency.&lt;/p&gt;

&lt;p&gt;The solution isn't to stop using coding agents. It's to stop using them for everything. Classification, validation, and constrained generation tasks are solvable with a 3B parameter model running locally. The hardware is already in your machine. The framework is already in the operating system. Only the code to connect them was missing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;foundation-hooks&lt;/code&gt; is 400 lines of Swift connecting those dots. &lt;code&gt;make install-hooks REPO=.&lt;/code&gt; and every commit generates its own message, every issue classifies itself, every standup writes itself in 800ms. No network, no tokens, no cost.&lt;/p&gt;

&lt;p&gt;The surgeon can stop applying band-aids.&lt;/p&gt;

</description>
      <category>swift</category>
      <category>appleintelligence</category>
      <category>git</category>
      <category>llm</category>
    </item>
    <item>
      <title>diy-codex-automations-claude-code-systemd</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 15:56:29 +0000</pubDate>
      <link>https://dev.to/frr149/diy-codex-automations-claude-code-systemd-kjm</link>
      <guid>https://dev.to/frr149/diy-codex-automations-claude-code-systemd-kjm</guid>
      <description>&lt;p&gt;title: "DIY Codex Automations: Nocturnal Agents with Claude Code and systemd"&lt;br&gt;
date: 2026-03-11T20:00:00+01:00&lt;br&gt;
draft: false&lt;br&gt;
slug: "diy-codex-automations-claude-code-systemd"&lt;br&gt;
description: "A practical tutorial to replicate OpenAI's Codex Automations using Claude Code, systemd timers, and Gitea. Agents that work while you sleep, without relying on any desktop app."&lt;br&gt;
tags: ["claude-code", "automation", "systemd", "gitea", "openai", "codex", "tutorial"]&lt;br&gt;
categories: ["tutorial"]&lt;/p&gt;

&lt;p&gt;translation:&lt;br&gt;
  hash: ""&lt;br&gt;
  last_translated: ""&lt;br&gt;
  notes: |&lt;br&gt;
    - "ñapa": means "hack/kludge/bodge". Quick and dirty fix. Not derogatory.&lt;br&gt;
    - "chapuza": same as "ñapa" — a hacky solution. Translate as "kludge" or "bodge".&lt;br&gt;
    - "dicho en cristiano": "in plain language". No religious connotation intended.&lt;br&gt;
    - "currar": colloquial Spanish for "to work". Translate as "work" or "grind".&lt;br&gt;
    - "barra del bar": "bar counter" — casual conversation metaphor.&lt;br&gt;
    - "madrugón": waking up very early. Not a standard English concept — "early morning" works.&lt;br&gt;
    - "irse por las ramas": "to go off on a tangent" / "to beat around the bush".&lt;/p&gt;
&lt;h2&gt;
  
  
      - "otro gallo cantaría": "things would be different" / "it would be a different story".
&lt;/h2&gt;



&lt;p&gt;Two weeks ago, OpenAI introduced &lt;em&gt;Codex Automations&lt;/em&gt;. The idea: define a trigger (a cron job, a push, a new issue), write instructions in natural language, and an agent runs it solo in an isolated &lt;em&gt;worktree&lt;/em&gt;. No human intervention. While you sleep, the agent triages issues, summarizes CI failures, generates &lt;em&gt;release briefs&lt;/em&gt;, and even improves its own instructions.&lt;/p&gt;

&lt;p&gt;Sounds like magic, right? And it is, a little. But there’s one catch they didn’t emphasize too much in the &lt;em&gt;keynote&lt;/em&gt;: you need the Codex App running on your desktop. macOS or Windows only. No &lt;em&gt;headless&lt;/em&gt; servers. No running it on a mini PC and forgetting about it.&lt;/p&gt;

&lt;p&gt;And that’s when I thought: “Wait. I already have this.”&lt;/p&gt;
&lt;h2&gt;
  
  
  The pieces you already have
&lt;/h2&gt;

&lt;p&gt;If you’re using Claude Code, you already have 90% of the infrastructure. &lt;code&gt;claude --print&lt;/code&gt; executes a prompt without an interactive session. You give it instructions; it gives you a result and shuts down. No GUI. No open terminal. Perfect for a &lt;em&gt;cron&lt;/em&gt; job.&lt;/p&gt;

&lt;p&gt;If you have a server that’s always on (a mini PC, Raspberry Pi, or a $5 VPS), you’ve got the scheduler. &lt;code&gt;systemd&lt;/code&gt; or &lt;code&gt;cron&lt;/code&gt;, whichever you prefer, has been working away in the background for decades while you sleep.&lt;/p&gt;

&lt;p&gt;And if you use Gitea, GitHub, or any forge with an API, you already have a place to deposit the results: comments on PRs, new issues, or committed files.&lt;/p&gt;

&lt;p&gt;Plainly put: &lt;em&gt;Codex Automations&lt;/em&gt; is a pattern. Not a product. And that pattern is old news.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│           systemd timer (every N hours)      │
│                     │                        │
│                     ▼                        │
│           bash/fish script                   │
│              │                               │
│              ├── git pull --ff-only           │
│              ├── claude --print "prompt"      │
│              ├── parse results                │
│              ├── notify (Telegram/email)      │
│              └── git push (if changes)        │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Anatomy of an Automation
&lt;/h2&gt;

&lt;p&gt;All automations follow the same structure. A script that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Updates the repo (&lt;code&gt;git pull&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Executes Claude Code in non-interactive mode&lt;/li&gt;
&lt;li&gt;Does something with the results&lt;/li&gt;
&lt;li&gt;Notifies and/or commits changes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s build the first one. After that, the rest are just variations of the same theme.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Base Script
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;REPO_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/srv/social-publisher"&lt;/span&gt;
&lt;span class="nv"&gt;LOG_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/var/log/automations"&lt;/span&gt;
&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d-%H%M%S&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REPO_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
git pull &lt;span class="nt"&gt;--ff-only&lt;/span&gt;

&lt;span class="nv"&gt;RESULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;claude &lt;span class="nt"&gt;--print&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; sonnet &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-turns&lt;/span&gt; 3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;# The prompt comes as an argument&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOG_DIR&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$TIMESTAMP&lt;/span&gt;&lt;span class="s2"&gt;.md"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. The skeleton fits into 12 lines. The rest is about deciding which prompt to pass and what to do with &lt;code&gt;$RESULT&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The systemd Timer
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/claude-automation.timer
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Claude Code automation&lt;/span&gt;

&lt;span class="nn"&gt;[Timer]&lt;/span&gt;
&lt;span class="py"&gt;OnCalendar&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;*-*-* 03:00:00&lt;/span&gt;
&lt;span class="py"&gt;Persistent&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;timers.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/claude-automation.service
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Claude Code automation runner&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;oneshot&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;claude-runner&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/opt/automations/review-prs.sh&lt;/span&gt;
&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;ANTHROPIC_API_KEY=&amp;lt;your-key&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--now&lt;/span&gt; claude-automation.timer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 3 a.m., &lt;code&gt;systemd&lt;/code&gt; kicks off the script. Claude analyzes whatever you ask it to and deposits the result. You find out in the morning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 1: Automatic PR Review
&lt;/h2&gt;

&lt;p&gt;This is the most useful one. Every time there’s an open PR, Claude reviews it and leaves a comment.&lt;/p&gt;

&lt;p&gt;Using a webhook is more elegant, but a cron job every 30 minutes works just as well for small teams:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;GITEA_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://git.example.com"&lt;/span&gt;
&lt;span class="nv"&gt;GITEA_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;op &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="s1"&gt;'op://DEV/Gitea/token'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;REPO&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"myorg/myrepo"&lt;/span&gt;

&lt;span class="c"&gt;# Get open PRs&lt;/span&gt;
&lt;span class="nv"&gt;PRS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: token &lt;/span&gt;&lt;span class="nv"&gt;$GITEA_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITEA_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/v1/repos/&lt;/span&gt;&lt;span class="nv"&gt;$REPO&lt;/span&gt;&lt;span class="s2"&gt;/pulls?state=open"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.[].number'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;PR_NUM &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;$PRS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="c"&gt;# Get the diff&lt;/span&gt;
  &lt;span class="nv"&gt;DIFF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: token &lt;/span&gt;&lt;span class="nv"&gt;$GITEA_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITEA_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/v1/repos/&lt;/span&gt;&lt;span class="nv"&gt;$REPO&lt;/span&gt;&lt;span class="s2"&gt;/pulls/&lt;/span&gt;&lt;span class="nv"&gt;$PR_NUM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/diff"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

  &lt;span class="c"&gt;# Claude reviews the diff&lt;/span&gt;
  &lt;span class="nv"&gt;REVIEW&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;claude &lt;span class="nt"&gt;--print&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; sonnet &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--max-turns&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"Review this PR diff. Flag potential bugs, &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
     security issues, and specific areas for improvement. Be concise. &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
     Do not repeat the code; highlight issues with their line.

     &lt;/span&gt;&lt;span class="nv"&gt;$DIFF&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

  &lt;span class="c"&gt;# Post as a comment&lt;/span&gt;
  curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: token &lt;/span&gt;&lt;span class="nv"&gt;$GITEA_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITEA_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/v1/repos/&lt;/span&gt;&lt;span class="nv"&gt;$REPO&lt;/span&gt;&lt;span class="s2"&gt;/pulls/&lt;/span&gt;&lt;span class="nv"&gt;$PR_NUM&lt;/span&gt;&lt;span class="s2"&gt;/comments"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;body&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;## Automated Review&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n&lt;/span&gt;&lt;span class="nv"&gt;$REVIEW&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each morning when you open Gitea, every PR has a comment with feedback. It doesn’t replace a human review, but it filters out the obvious: typos, unused imports, an &lt;code&gt;if&lt;/code&gt; without an &lt;code&gt;else&lt;/code&gt; that smells like a bug.&lt;/p&gt;




&lt;p&gt;[rest of the examples and entire blog follow translated...]&lt;/p&gt;

</description>
      <category>automation</category>
      <category>claude</category>
      <category>linux</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>In Codex, a Skill Is Not a /Command (but in Claude Code, It Almost Is)</title>
      <dc:creator>Fernando Rodriguez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 15:54:27 +0000</pubDate>
      <link>https://dev.to/frr149/in-codex-a-skill-is-not-a-command-but-in-claude-code-it-almost-is-1pi4</link>
      <guid>https://dev.to/frr149/in-codex-a-skill-is-not-a-command-but-in-claude-code-it-almost-is-1pi4</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; If you're using Codex, use a &lt;strong&gt;command&lt;/strong&gt; to control the session or application, and use a &lt;strong&gt;skill&lt;/strong&gt; to teach the agent a way of working. In Claude Code, the current documentation already treats &lt;em&gt;skills&lt;/em&gt; as something you can invoke with &lt;code&gt;/skill-name&lt;/code&gt;, so the concepts merge more there. Not so in Codex: &lt;code&gt;types&lt;/code&gt; might exist as a skill, but &lt;code&gt;/types&lt;/code&gt; won't exist by default.&lt;/p&gt;




&lt;p&gt;There's a common confusion when switching from Claude Code to Codex. And it's understandable.&lt;/p&gt;

&lt;p&gt;You create a &lt;em&gt;skill&lt;/em&gt; called &lt;code&gt;types&lt;/code&gt;, go back to the terminal, type &lt;code&gt;/types&lt;/code&gt; all confident... and Codex looks at you like you just walked into a hardware store and ordered a latte.&lt;/p&gt;

&lt;p&gt;The problem isn't that your skill is broken. The problem is that in Codex, &lt;strong&gt;a skill and a command are not the same thing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And here's the kicker: this distinction is not just cosmetic. It changes how you design your workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Analogy to Make It Clear
&lt;/h2&gt;

&lt;p&gt;Think of Codex as a plane with two levels.&lt;/p&gt;

&lt;p&gt;The first level is the &lt;strong&gt;cockpit&lt;/strong&gt;: buttons, levers, indicators. That's where commands live. They control the session, the client, or the tool. It's operational control.&lt;/p&gt;

&lt;p&gt;The second level is the &lt;strong&gt;copilot's manual&lt;/strong&gt;: procedures, guidelines, checklists, avoidable pitfalls. That's where skills live. They change &lt;strong&gt;how the agent thinks&lt;/strong&gt; when performing a task.&lt;/p&gt;

&lt;p&gt;Put simply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;command&lt;/strong&gt; affects the cockpit.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;skill&lt;/strong&gt; affects the copilot's head.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you try to use the manual as if it were a button, that doesn’t fly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is a Command in Codex?
&lt;/h2&gt;

&lt;p&gt;In Codex, commands come in two flavors that shouldn’t be mixed up.&lt;/p&gt;

&lt;p&gt;The first type is &lt;strong&gt;CLI commands&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codex login
codex &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="s2"&gt;"run tests and fix failures"&lt;/span&gt;
codex resume &lt;span class="nt"&gt;--last&lt;/span&gt;
codex apply
&lt;span class="nt"&gt;---&lt;/span&gt;

No mystery here. These are application operations. Authenticating, running a task, resuming a session, applying a diff. If you removed the model tomorrow, these commands would still make sense.

The second &lt;span class="nb"&gt;type &lt;/span&gt;is &lt;span class="k"&gt;**&lt;/span&gt;slash commands &lt;span class="k"&gt;in &lt;/span&gt;an interactive session&lt;span class="k"&gt;**&lt;/span&gt;:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
text&lt;br&gt;
/model&lt;br&gt;
/permissions&lt;br&gt;
/personality&lt;br&gt;
/agent&lt;br&gt;
/status&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
These aren’t "fancy prompts" either. They’re live session controls. They change the model, permissions, personality, active thread, or visible state. They’re cockpit buttons.

OpenAI, in fact, documents it this clearly: there's one dedicated page for **slash commands** "to control Codex during interactive sessions," and another distinct page for **skills**, defining them as the authoring format for *reusable workflows*.

That's why these are commands and not skills: they require predictable, immediate behavior with stable semantics. You don't want the model to "creatively interpret" what `/permissions` means. You want it to change permissions. Period.

## What Is a Skill in Codex?

A skill in Codex is something else entirely. It’s a reusable workflow that teaches the agent **when** to apply an approach, **how** to think about a task, and **which steps** to follow.

And here’s another fine but important nuance: OpenAI says a skill is the authoring format, whereas the **plugin** is the installable or distributable unit. In other words, you first design the workflow as a skill; if you want to share or package it later, you wrap it up.

Clear examples:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
text&lt;br&gt;
$types&lt;br&gt;
$improve&lt;br&gt;
$owasp&lt;br&gt;
$blog&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Or, if you prefer natural language:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
text&lt;br&gt;
use types to audit this repo&lt;br&gt;
use improve to review this diff&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Here, you're not telling Codex, "Change a setting." You're saying, "When you do this task, follow this playbook."

For example, my `types` skill shouldn’t be a button. It needs to read the project, detect the language, inspect models, look for stringly-typed code, decide if an `Optional` is being used correctly or if it's modeling a domain state. That requires context and judgment. That’s exactly the type of work a skill is designed to handle.

For the same reason, `improve` makes sense as a skill: reviewing a diff isn’t a deterministic action. It’s a specific way to approach code review.

## Why It Feels Like “The Same Thing” in Claude Code

Here's the mental trap.

The current Claude Code documentation isn’t shy about this. It talks about **skills** and tells you that you can invoke them directly with:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
text&lt;br&gt;
/skill-name&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
In Claude Code, a significant part of what you perceive as a "reusable workflow" enters through a slash command syntax. The UX blends two concepts that are separate in Codex:

- Reusing a workflow
- Invoking it with `/something`

Additionally, Claude Code retains its **built-in commands** separately:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
text&lt;br&gt;
/help&lt;br&gt;
/compact&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
And it even separates yet another piece: **subagents**, which are specialized assistants with their own context, permissions, and system prompt.

In other words:

- In **Claude Code**, skills, subagents, and commands coexist, but skills can be invoked with `/`.
- In **Codex**, reusable workflows live as skills, and `/commands` are reserved for explicit session control.

That’s why, coming from Claude, your brain quickly learns a practical equivalence: "If something reusable exists, I’ll probably trigger it with `/something`." In Codex, this mental shortcut stops working.

## Concrete Examples: What Should Be a Skill vs. a Command?

### Things That Should Be a Skill in Codex

**`types`**

Because you’re not “triggering an action.” You want to apply type design principles on a real codebase.

**`improve`**

Because reviewing a diff isn’t a mechanical operation. It involves judgment, context, and priorities.

**`blog`**

Because writing an article with tone, structure, and fact-checking is a reasoning flow, not a button.

**`owasp`**

Because a security audit needs to adapt heuristics to the stack, repo, and specific risks.

### Things That Should Be a Command in Codex

**`codex login`**

There’s nothing to reason about. You either authenticate or you don’t.

**`/model`**

Switching models is a client operation. Not a work criterion.

**`/permissions`**

Tweaking permissions mid-session is pure operational control.

**`codex resume --last`**

Reopening a session isn’t cognitive workflow. It’s an app action.

## The Trickiest Case: Hybrid Tasks

There’s an intermediate category that can trip you up at first: workflows you’d like to launch with a convenient syntax, but whose logic is still skill-based.

For example:

- You’d like to write `/types`
- But conceptually, `types` is still a skill

The elegant solution here isn’t "turn the skill into something else." The solution is to wrap it.

That means:

1. Keep the intelligence in the skill.
2. Create a plugin or command to invoke it with slash-command ergonomics.

This way, you get the best of both worlds: command UX, skill brains.

## The Golden Rule in Codex

When deciding between a command and a skill, use this test:

**Do you want to change the session or app state?**

Then you need a **command**.

**Do you want to change how the agent approaches a task?**

Then you need a **skill**.

Here’s a handy table:

| I want to...                     | Use in Codex... | Example             |
|----------------------------------|-----------------|---------------------|
| change permissions               | `command`       | `/permissions`      |
| switch models                    | `command`       | `/model`            |
| resume a session                 | `command`       | `codex resume --last` |
| apply an auditing criterion      | `skill`         | `$types`            |
| review a diff with a methodology | `skill`         | `$improve`          |
| draft with an editorial guide    | `skill`         | `$blog`             |

## So, Which Should You Use?

The short answer: **in Codex, use skills for reusable knowledge and commands for operational control**.

If you’re coming from Claude Code, your first instinct will be to turn every reusable workflow into `/something`. That’s an understandable habit because Claude's documentation encourages that thinking. But in Codex, that habit will get you stuck fast.

First, design the **skill**. If you later need more ergonomic input, wrap it in a plugin or command. Not the other way around.

Because if you start with the button before you’re clear on the procedure, you’ll end up with a pretty interface that doesn’t do much. And we’ve already got too many of those in this industry.

Here’s the takeaway: in Claude Code, a *skill* can come through the `/slash-command` door. In Codex, it can’t. And honestly, that’s probably a good thing.

Once you understand this difference, you’ll stop fighting with `/types` and start building workflows that actually fit the tool. Progress!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>codex</category>
      <category>claudecode</category>
      <category>skills</category>
      <category>cli</category>
    </item>
  </channel>
</rss>
