DEV Community: Fernando Rodriguez

From /simplify to the Jedi Council: How I Built a Code Review with Kent Beck, Martin Fowler, and Mike Acton

Fernando Rodriguez — Thu, 30 Apr 2026 16:21:52 +0000

Claude Code includes a slash command called /simplify that automatically reviews your code. I ran it on a hefty diff — about 500 lines across 8 files — and the results were... interesting. It found things I wouldn’t have noticed, but it also wasted my time pointing out stuff that didn’t matter.

So, I took it apart and rebuilt it piece by piece.

What Does /simplify Do?

It’s a skill that comes bundled with Claude Code (you don’t install it). It launches three agents in parallel, each looking at the same diff from a different angle:

Code Reuse — Are there existing utilities that could replace newly added code?
Code Quality — Redundant state, copy-paste, leaky abstractions, stringly-typed code.
Efficiency — Unnecessary I/O, missed concurrency opportunities, memory leaks.

The three produce findings, and then the system tries to fix the issues directly.

What It Does Well

The reuse agent caught a helper that was duplicated verbatim in two test suites. Same name, same lines, two different files. I moved it to a shared module. Nice and clean.

The efficiency agent spotted a double trip to disk inside a processing loop: load state, modify, save, read data, re-load, re-save. Two writes when one would suffice. I wouldn’t have noticed that myself.

It also flagged a memory buffer that wasn’t cleaned up in the error path. If something failed between allocation and release, leak. The main path was fine. Classic copy-paste swallowing the detail.

So far, so good. Three legitimate, actionable findings. But the problem with /simplify isn’t what it catches — it’s everything else it reports.

Where It Falls Short

Too much noise in low-severity issues. It suggested removing a field from a struct because it was “redundant” with a computed property. We’re talking 8 bytes. That field is used in more than 10 places in the code and the tests. The churn of removing it far outweighs the benefit of saving a single integer.

No understanding of project context. It flagged a concurrency pattern as HIGH risk, which is fair — that’s correct in the abstract. But it had already been documented in the project’s CLAUDE.md, had a dedicated linter, was allowlisted, and had an open issue. The agent didn’t know any of this because it works only with the diff, in complete isolation.

Doesn’t distinguish "incorrect" from "improvable." The double disk trip was inefficient but correct. The concurrency pattern was a latent bomb. Both came back as MEDIUM priority. The prioritization is flat.

Suggests enums for external data. It claimed that some fields in a DTO should be enums instead of strings. But those fields come from an external API. They’re only read and displayed. Turning them into enums requires custom decoding and adds nothing — if the API sends a new value, your enum blows up instead of gracefully degrading.

These are mistakes a developer with project context would filter out in two seconds. But /simplify has no context. It has a diff and good intentions.

The Three Fixes I Made

After reviewing the outputs, I identified three structural problems with /simplify and fixed them in a custom skill I called /improve.

1. Inject Project Context

Each agent receives the CLAUDE.md, open issues from the tracker, and linter results before generating findings. If something is already managed, it mentions it but doesn’t report it as new.

This eliminates the most irritating category of false positives: the ones you already know about and have under control.

2. Cost/Benefit Filtering

Before reporting, each agent estimates how many files the fix would touch. If the effort-to-improvement ratio is negative — like renaming a field in 10+ spots for minor readability gains — it filters it out.

This seems obvious, but /simplify doesn’t do it. It treats a one-line change and a 15-file refactor with the same priority.

3. Separate "Auto-Fix" from "Backlog Issue"

Findings are split into two types:

auto-fix: Mechanical, ≤3 files, low risk. Applied directly.
issue: Requires design, touches >3 files, or changes an interface. Created as a tracker issue.

This prevents the review from attempting fixes that need more thought.

What I Didn’t Do (And Why)

A second LLM as a reviewer. Sexy idea — cross-model validation, more eyes, additional training. In practice, the bottleneck isn’t the number of eyes but the quality of the context. A second model without access to the CLAUDE.md or tracker spits out the same thing: generic “best practices” advice you can find in any book.

Categorization into 4 severity levels. I started with CRITICAL/HIGH/MEDIUM/LOW, but with cost/benefit filtering active, almost everything that passes the filter is MEDIUM or HIGH. The other two categories are empty. More taxonomy doesn’t mean better prioritization.

The Jedi Council

And here’s the idea that changed the game.

A few weeks ago, I wrote about invoking experts as mentors — asking an LLM to adopt the perspective of Tufte, Munger, or whoever fits your needs. It worked brilliantly in design.

What if, instead of three generic agents (reuse, quality, efficiency), I used three agents with names, philosophies, and specific decision rules?

The idea has a name every Star Wars fan will recognize: the Jedi Council. Three masters with different perspectives evaluating the same case. But be careful — this isn’t about the LLM doing surface-level impersonations by quoting famous lines. It’s about each “wise master” applying specific filtering rules that a generic reviewer wouldn’t.

The Three Masters (and Why These)

Kent Beck — Simplicity. "Make it work, make it right, make it fast — in that order." He’s the guy who tells you “those two identical blocks of code are fine, don’t extract a helper just yet.” His key rule: The Rule of Three. DO NOT report duplication unless the same block appears three times. Twice is coincidence. Three times is a pattern. And if the fix touches more files than the code it’d clean up, it’s probably not worth it.

But Beck isn’t just about simplicity. He also catches correctness bugs: cases where the obvious choice has semantics different from the correct one. That async keyword that seems harmless but inherits a context you don’t want. The default that works in tests but blows up in production.

Martin Fowler — Design. Code smells are symptoms, not diseases. Refactoring is a discipline, not a hobby. His key rule: Only suggest refactoring if there’s a concrete change it would benefit. "Refactoring without direction is codebase tourism." If a string comes from an external API and is only read, don’t suggest converting it to an enum. If one field is always synchronized with another by design, the redundancy is intentional.

Mike Acton — Performance. "The purpose of all programs is to transform data from one form to another." If you haven’t measured, you don’t have a performance problem — you have an opinion. His key rule: I/O is what matters in 99% of apps. CPU rarely bottlenecks. Disk and network do.

Acton Doesn’t Guess — He Measures

Here’s where it gets interesting. Mike Acton doesn’t stop at static analysis. He does two things before rendering a verdict:

Static I/O Counting: Scans the diff for read/write operations to disk, network, or databases. Maps each operation to its context: Is it in a loop? A hot path? Generates a frequency table before opining.

Real Profiling: If the diff touches hot path code and the project can compile, runs a profiler and condenses the results. If a hotspot aligns with code from the diff, it reports it with numbers, not opinions.

The I/O table includes rough time estimates: SSD read ~0.5ms, write ~1ms, flush ~2-5ms, network ~100-500ms. It’s not precise — it’s for spotting operations that, in aggregate, cross a threshold.

The Risk (And How to Avoid It)

Before you start building a Jedi Council for every pull request, here’s the elephant in the room: The LLM can do surface-level impersonation. It might output “as Kent Beck would say…” and just spout the same generic advice under his name.

To avoid this, the instructions don’t say “adopt Kent Beck’s perspective.” They say: "Apply the Rule of Three: if a fix touches more files than it cleans up, discard it." Specific rules, not vibes.

Also, each master must finish with a “Discarded” section — findings they considered but rejected, with the rule applied. This makes it clear that the master actively filtered, not just reported less.

And if two masters disagree on the same code — Beck says “don’t touch” while Fowler says “refactor” — a moderator agent evaluates the specific case until consensus is reached. If no consensus → discard. Better to do nothing than do the wrong thing.

Is it the same as having Kent Beck in the room? Obviously not. But it’s infinitely better than three generic agents reporting everything without judgment.

The Test: Same Diff, Two Reviews

I ran the same diff through /simplify and /improve. Same changes, same project, same session:

	/simplify	/improve
Reported Findings	8	5
False Positives	3-4	0
New Findings	0	1 (concurrency bug, HIGH)
"Discarded" Section	No	Yes, with applied rule
Project Context	No	CLAUDE.md + tracker + linters

The new finding /improve caught that /simplify didn’t: a concurrency bug where an apparently correct pattern inherited a faulty execution context, causing the UI to freeze. In plain language: the code looked fine, compiled cleanly, but it blocked the main thread. /simplify missed it because its generic agents don’t look for bugs where the “obvious” choice is wrong. Kent Beck did, because that’s exactly his mandate.

The false positives from /simplify — the “redundant” 8-byte field, the already managed concurrency pattern, the enums for external JSON — didn’t show up in /improve. Cost/benefit filtering caught the first. Project context filtered out the second. Fowler’s rule (“stringly-typed is fine unless it hurts”) discarded the third.

What sold me the most: the “Discarded” section. Seeing what each master considered and why they rejected it inspires far more trust than just seeing what they reported. You know they looked at more than they said.

How to Install It

The skill is called /improve and lives in ~/.claude/skills/improve/SKILL.md. It’s a global skill for Claude Code — works in any project.

# Create the directory
mkdir -p ~/.claude/skills/improve

# Copy the SKILL.md (or write your own following Claude Code’s skill structure)

Usage:

# Review only (no code changes)
/improve

# Review + apply mechanical fixes
/improve --fix

# Review + draft report for senior dev
/improve --report

# Review a specific commit range
/improve --diff HEAD~5..HEAD

Summary

/simplify is a solid starting point. Three generic agents detect duplication, inefficiencies, and code smells. But without project context, it creates noise; without cost/benefit filtering, it suggests changes that aren’t worth it; and without specific criteria, it treats all findings equally.

/improve is the next step: three masters with targeted philosophies, project context, cost/benefit analysis, and separation between auto-fixes and backlog issues. Beck tells you when NOT to extract a helper. Fowler tells you when a smell is purely cosmetic. Acton tells you when a “performance problem” is just an unmeasured opinion.

Fewer findings, zero false positives, and one real bug the other missed. Sometimes improvement isn’t more eyes — it’s better eyes.

Related: Invoking the Experts — the original technique with Tufte and Munger. /loop vs claude-cron — another Claude Code skill I analyzed.

RustyClaw: I'm rewriting an AI agent in Rust (because the meme demands it)

Fernando Rodriguez — Thu, 30 Apr 2026 16:19:50 +0000

"You know what’s great about Rust? It doesn’t let you compile crappy code. You know what sucks? Everything you write at the beginning **is* crappy code."*
— Mr. Krabs, probably

What’s better than an AI agent? An AI agent rewritten in Rust.

If you’ve spent more than five minutes on the internet, you’re aware of the meme. It doesn’t matter what project—text editor, DNS server, BMI calculator. Someone will inevitably comment, "you should rewrite it in Rust." It’s the Rewrite It In Rust—RIIR for friends—and it’s as unavoidable as gravity.

Well, I’m actually doing it. I’m going to port 8,300 lines of a Python AI agent to Rust. But not just because the meme demands it (okay, maybe a little). I’m doing it because I need a guinea pig.

The thesis

For weeks now, I’ve been writing about silent failures, about the five defenses against hallucinations, about how an LLM can generate code that compiles, passes tests, and is still wrong. I even gave it a name: adversarial development. Never trust, always verify.

A lot of theory. Now it’s time to prove it.

I needed a project with three key traits: constrained scope (not a new app with ever-changing requirements), a clear source of truth (the Python code that already works), and enough complexity for the LLM’s hallucinations to have room to hide. A pure port checks all three boxes: the input and expected output already exist. If the Rust version doesn’t behave exactly like the Python one, there’s a bug. Simple as that.

And since I’m going to port something, why not use it as an opportunity to properly learn Rust? The borrow checker, ownership, lifetimes... I’ve spent years reading all about it and touching none of it. Things would be different if I stopped reading tutorials for the 20th time and actually tackled a real project.

The patient

It’s called nanobot. It’s a personal AI agent derived from OpenClaw: a nifty tool that links LLMs (Claude, GPT, DeepSeek, you name it) to chat channels—Telegram, Discord, Slack, email—and gives them hands. It can read/edit files, run commands, browse the web, schedule cron tasks, and maintain persistent memories between conversations.

It works. It’s been running fine. In Python.

What’s the problem? It’s single-threaded. One message at a time. Send it three messages back-to-back, and they queue up like a Saturday morning line at Walmart. It uses about 50MB of RAM to essentially shuffle JSON between APIs. And its error handling is the type you’re embarrassed about: return f"Error: {str(e)}" scattered all over.

To put it bluntly: it works, but it’s a giant hack. Perfect candidate.

Why Rust (besides the meme)?

I could fix it in Python. I could dial up the asyncio, tighten up error-handling with custom exceptions, and optimize memory. The sane option.

But sane doesn’t give me a test bench for adversarial development. Refactoring in Python lacks an external source of truth—the "before" and "after" would share language, libraries, and the LLM’s biases. A port to a different language? That’s different. If Rust’s output differs from Python’s for the same input, somebody’s lying. And that’s exactly the kind of verification I want to test.

Plus, Rust comes with properties that make the experiment more interesting:

The compiler as a first line of defense. Nulls, type mismatches, data races—entire categories of bugs that might silently creep into Python won’t even compile in Rust. How many LLM hallucinations can the compiler block before they hit a test? I want to measure that.
True concurrency. tokio allows one spawn per conversation. In Python, that’s a pain. This is the one functional improvement that really justifies the port.
Static binaries. A 10MB executable instead of a pip install with 47 dependencies. That’s a win for distribution.
It’s cool. Not technically a reason, but I don’t care.

The adventure (and the invite)

RustyClaw—that’s the port’s name—is going to be a publicly documented experiment. Each module I port will be its own blog post. With real data: how many tokens used, cost, how often the AI hallucinated, and how long I fought with the borrow checker. No sugarcoating.

If I spend 3 hours on something I could have done in Python in 10 minutes, I’ll admit it. If the LLM invents a non-existent crate (spoiler: it will), I’ll detail it. If I realize at the end this port wasn’t worth it, I’ll confess to that too.

Everyone says, "I used AI to write code." No one publishes how much it cost, how often it lied to them, or if the code held up in production. That’s exactly what I’m going to do.

And I want you to come along for the ride. Because this is going to be an adventure—filled with compiler battles, "WHY WON’T THIS COMPILE IT’S OBVIOUS" moments, and small victories when a differential test passes green. It’s going to be fun. Or, at the very least, honest.

The stack (cheat sheet)

If you’re a Pythonista, the left column will look familiar. If you’re a Rustacean, the right. If you’re neither, welcome to the chaos.

Layer	Python (nanobot)	Rust (rustyclaw)
Async runtime	`asyncio`	`tokio`
HTTP	`httpx`	`reqwest`
LLM routing	`litellm`	Nonexistent — custom router
Telegram	`python-telegram-bot`	`teloxide`
Discord	`websockets` (raw)	`tokio-tungstenite` (raw)
Config	`pydantic`	`serde` + `figment`
CLI	`typer`	`clap`
Errors	`str(e)`	`anyhow` + `thiserror`
Logging	`loguru`	`tracing`
AI copilot	—	Claude Code + Codex
Task runner	`make`	`just`
Issue tracker	—	`linear` CLI

The row that hurts most is LiteLLM. In Python, it routes 100+ LLM providers in a single call. Nothing comes close in Rust. I’ll need to roll my own router. The upside? About 80% of LLM providers conform to OpenAI’s API, so between async-openai + a custom base URL, most use-cases are covered. Anthropic will need its own implementation.

Around ~300 lines of Rust. Sounds manageable. Sounds.

Anti-hallucination strategy (the serious bit)

This is where the adversarial development theory meets reality. An LLM assisting in a port this size is a machine for plausibly inventing things.

The top risk isn’t that the code won’t compile—Rust doesn’t let garbage compile. The risk is that it compiles, passes tests, and silently does the wrong thing. Exactly the silent failure I wrote about two weeks ago.

Five layers of defense:

1. Rust’s compiler. Eliminates nulls, type mismatches, and data races. First free line of defense. But just because it compiles doesn’t make it right.

2. Differential tests. Same input → Python nanobot → output. Same input → RustyClaw → output. If they don’t match, something’s off. The Python code is the source of truth. This is the backbone of the experiment.

3. Provenance tracking. Each ported file gets a header with its original Python source, LLM session, and test differential results. Total traceability.

4. Crate verification. Every crate suggested by the LLM → manually verify on crates.io and docs.rs. LLMs will confidently propose non-existent crates and APIs that just don’t work.

5. Incident logging. Every detected hallucination → an issue logged with a hallucination label. Material for posts and lessons learned.

The golden rule:

The verification system must be external to the generator.

If the LLM writes the code, the tests, and the fixtures, you’re validating fiction with fiction. Differential testing against the original Python code naturally breaks the cycle and makes the port inherently verifiable.

Does it matter?

So, the uncomfortable question—does porting this to Rust even matter?

Metric	Python	Rust (estimated)	Does it matter?
Response latency	~200ms overhead	~5ms overhead	No. The LLM takes 2-5 seconds anyway.
RAM	~50MB	~5MB	No. My server has 8GB.
Concurrency	1 message at a time	N messages in parallel	Yes.
Startup time	~2s	~50ms	Meh.
Binary	`pip install` + 47 deps	Single executable	Yes.
Type safety	`str(e)` everywhere	`Result<T, E>`	Yes.
The cool factor	None	High	Subjective.

Three out of seven. Four, if we’re being generous. The latency and RAM improvements are meaningless since the bottleneck is always the LLM call. Concurrency matters for multiple users. A static binary is a real upgrade. And the type safety? After seeing how many bugs str(e) lets fly under the radar for months, yeah, that matters.

Does it justify weeks of work? As a standalone port, probably not. As a testbed for adversarial development with published real-world data? I think yes. By the end of this series, we’ll have hard numbers—not opinions.

The raw numbers

Every work session will be logged in a public CSV in the repo:


csv
date,llm,model,module,tokens_in,tokens_out,cost_usd,duration_min,loc_python,loc_rust,hallucinations,tests_pass
---

Which LLM I used, tokens consumed, cost, duration, lines ported, hallucinations detected, tests passed. It’ll all be public. All verifiable.

At the end of this series, anyone will be able to sum up `cost_usd` and decide if RIIR was worth it. Anyone will be able to count hallucinations and decide if adversarial development works or is just hype. Spoiler: I have no idea what the numbers will be. And that’s what makes it interesting.

## Join me

- **Repo:** [github.com/frr149/rustyclaw](https://github.com/frr149/rustyclaw)—code, issues, tracking
- **Blog:** Each phase will have its own post here in the *RustyClaw: Rewrite It In Rust* series
- **Backlog:** Public on Linear, visible via GitHub issues

What’s better than an AGI? An AGI rewritten in Rust. Just ask the meme. Now let’s prove it.

Why 99% of What You Send to Claude Is Already Cached

Fernando Rodriguez — Thu, 30 Apr 2026 16:16:48 +0000

I'm building an app that monitors my token consumption in Claude Code. A few days ago, looking at the raw numbers, I found this:

cacheReadInputTokens:     4,241,579,174
inputTokens:                  1,293,019

Four billion two hundred million tokens read from cache. One million three hundred thousand "fresh" tokens. That's a 99.97% cache hit rate.

My first reaction was thinking something was broken. Nobody has a 99% cache hit rate. Not Redis. Not Cloudflare. Not your mom when she claims she already knows what you're going to ask for dinner.

But it turns out it's not broken. This is exactly how it works. And the reason is as elegant as it is counterintuitive.

What Gets Cached Isn't Text

This is where most explanations fall short. When you read "prompt caching," you think of something like Redis: store the question, store the answer, if someone asks the same question, return the same answer.

Not at all.

What gets cached are KV tensors — the Key and Value matrices that the transformer calculates during the prefill phase. In simpler terms: when an LLM receives your prompt, the first thing it does is convert all that text into internal numerical representations (embeddings) and multiply them by weight matrices to get the "keys" (K) and "values" (V) that the attention mechanism needs to generate the response.

That calculation is expensive. In a 200,000-token prompt (normal for Claude Code, where conversation history accumulates), we're talking about billions of matrix multiplication operations. It's the most GPU-intensive part, the slowest part, the most expensive part.

The key insight: between one of your messages and the next, 99% of that prompt doesn't change. The system prompt is identical. The previous conversation history is identical. The files it read are the same. The only new thing is your latest message.

Why recalculate what you already calculated 30 seconds ago?

How Matching Works

Caching isn't enough. You need to know when the cache is valid. Anthropic uses an elegant trick: cumulative prefix hashing.

Each block of the prompt (system, tools, messages) generates a hash. But not an individual hash: a cumulative hash. The hash of block 3 includes the content of blocks 1, 2, and 3. If anything changes in a previous block, the hash of all following blocks changes too.

When a new request arrives, the system searches backwards from the point marked with cache_control, comparing hashes block by block, until it finds the longest matching prefix. Everything that matches → read from cache. Only the new stuff → gets calculated.

It's like a movie you've seen 40 times. You don't need to watch the whole thing to know what happens. You only need to watch from the point where it differs from what you remember.

The system only checks up to 20 blocks backwards. Beyond that, it stops searching. This is a practical decision to avoid spending more time searching the cache than calculating tensors directly.

Why Claude Code Has a 99% Cache Hit Rate

Now that you know how matching works, the 99% stops being mysterious. Look at what happens in a typical Claude Code session:

Message 1 (first in the session):

System prompt (8K tokens) + Tools (2K tokens) + Your message (500 tokens)
= 10,500 tokens → EVERYTHING calculated, EVERYTHING written to cache

Message 2:

System prompt (8K) + Tools (2K) + Message 1 (500) + Response 1 (3K) + Your message 2 (500)
= 14,000 tokens
→ First 10,500 → CACHE HIT (already calculated before)
→ The 3,500 new ones → calculated and added to cache
Cache hit: 75%

Message 10:

System prompt + Tools + 9 messages + 9 responses + Your message 10
= ~150,000 tokens
→ First ~149,500 → CACHE HIT
→ The ~500 new ones → calculated
Cache hit: 99.7%

See it? The conversation history only grows. Each new message is a tiny fraction of the accumulated total. The cache ratio converges to 99% with the certainty of a natural logarithm.

It's not magic. It's geometry: the numerator (new tokens) grows linearly; the denominator (accumulated tokens) also grows linearly, but it has a huge head start.

Where Those Tensors Live

This is where it gets interesting. Because caching KV tensors isn't like caching strings in Redis. We're talking about gigabytes of numerical data that need to be available with microsecond latency.

Anthropic uses a two-level system:

Level 1: VRAM (5-minute TTL)

The tensors live directly in the GPU memory that will serve the next request. Zero copy, zero network latency. Cache hits are nearly instantaneous because the data is already where it's needed.

TTL: 5 minutes. If nobody makes a request in 5 minutes, they get evicted. This is the cache you use with the standard API. Cache write price: 1.25x normal input price.

Level 2: GPU Node SSD (1-hour TTL)

If you pay for extended cache write (2x input price), tensors don't get evicted after 5 minutes. Instead, when they leave VRAM due to memory pressure, they get offloaded to the local SSD of the GPU node.

When a cache hit comes in, they're reloaded from SSD to VRAM. Slower than level 1, but infinitely faster than recalculating tensors from scratch.

The interesting part: no network involved. It's not a remote Redis. It's not S3. It's an SSD physically attached to the server that has the GPU. The architecture is designed to minimize data movement.

Request → In VRAM? → Yes → Instant cache hit
                   → No → In local SSD? → Yes → Load to VRAM → Cache hit (~ms)
                                        → No → Calculate KV tensors → Cache miss

Since February 2026, isolation is per workspace (previously per organization). This means tensors from your development team don't mix with the marketing team's, even if they're in the same Anthropic organization.

The Numbers

If you're evaluating whether this matters for your use case, here are the hard facts:

Concept	Value
Cache read	0.1x input price (90% discount)
Cache write 5 min	1.25x input price
Cache write 1 hour	2x input price
Latency reduction	~85% on long prompts
Minimum cacheable	1,024 tokens per checkpoint

With Sonnet, input costs $3.00/M tokens. A cache read costs $0.30/M. In a Claude Code session with 200K tokens of history, the difference between recalculating and reading from cache is the difference between $0.60 and $0.06 per message.

Multiply that by the hundreds of messages you might exchange in a long session and you understand why Anthropic invested in building this: without prompt caching, long conversations with huge context would be economically unfeasible.

My Real Data

Back to my numbers from the beginning. In my Claude Code usage over a month:

cacheReadInputTokens:       4,241,579,174  (4.2 billion — read from cache)
cacheCreationInputTokens:     196,596,243  (197 million — written to cache)
inputTokens:                    1,293,019  (1.3 million — calculated without cache)
outputTokens:                   2,517,666  (2.5 million — generated by the model)

Global cache hit rate: 95.5%. And within individual long sessions, it easily exceeds 99%.

Notice the asymmetry: I've read 4.2 billion tokens from cache, but the model has only generated 2.5 million tokens of output. The cache-read to actual-work ratio is 1,685:1. For every token the model produces, it reuses 1,685 tokens of previous context.

This also means cacheReadInputTokens isn't a good productivity metric. It doesn't measure how much you've "used" the model. It measures how much history the model has reread. It's like measuring your productivity by how many times you've opened the same file in your editor.

What Anthropic Doesn't Tell You

There are things that aren't public:

User→GPU affinity: How do they ensure your next request lands on the same node that has your cache? Probably sticky routing per session, but they don't confirm it.
SSD type: NVMe? CXL-attached? KV tensors for a 200K token prompt take up several GB. SSD speed matters a lot.
PagedAttention: vLLM (the most popular open-source serving engine) uses a technique called PagedAttention that manages KV tensors like virtual memory pages. Does Anthropic use something similar, or do they have something proprietary? Unknown.
Cluster topology: How many GPUs, how they're interconnected, whether they use InfiniBand or Ethernet. Nothing public.

The Analogy That Explains Everything

Think of prompt caching as a surgeon's working memory during an operation.

The surgeon (the model) has to process all the patient information (the prompt) to decide each move (the output). Without cache, they'd have to reread the complete medical history before each cut. With cache, they remember everything they already read and only need to process new information — the latest blood work, the tissue's response to the previous cut.

What gets saved isn't the patient's documents (the text). It's the intermediate conclusions the surgeon already extracted from those documents (the KV tensors). They don't need to reread the blood work. They already know what it says. They just need to integrate the new information with what they already know.

The 99% cache hit rate simply reflects that, in a conversation with an LLM, the amount of "what we already know" grows much faster than the amount of "new stuff to process."

And that's what makes it possible to have 200K token context conversations without each message costing you an arm and a leg.

Related: If you're interested in what happens when the app monitoring those tokens is based on data invented by the AI itself, read Silent failure: when your AI makes things up and tests say everything's fine. And if you want to see how I manage API secrets without 1Password asking for Touch ID every 30 seconds, authorization fatigue and a 40-line cache.

OpenAI scales PostgreSQL for 800 million users with a single writer (no sharding)

Fernando Rodriguez — Thu, 30 Apr 2026 16:14:46 +0000

Every time an article comes out about a large company's infrastructure, half the Hacker News comments are variations of "of course they use Kubernetes with 47 microservices and a distributed database with custom consensus protocol." And when it turns out they don't—that they use plain PostgreSQL with a single primary and discipline—there's an uncomfortable silence.

That just happened with OpenAI.

The numbers nobody expected

Bohan Zhang, infrastructure engineer at OpenAI, published details about how they scale PostgreSQL for ChatGPT. The numbers:

800 million users
A single PostgreSQL *primary* (writer) on Azure
~50 *read replicas*
Millions of queries per second
p99 of 10-19ms
99.999% availability
One SEV-0 in a year (and that was from ImageGen's viral launch, which added 100 million new users in a week)

Read that again. One. Single. Writer. For 800 million users.

"But they should shard"

No. And the reason is brutally pragmatic.

Sharding PostgreSQL would have required modifying hundreds of endpoints in the application. Every query that assumes all data lives in the same database—which is practically all of them—would need to be rewritten to know which shard contains each piece of data.

The cost of that migration? Months of engineering work, new bugs at every corner, and a transition period where you maintain both systems.

What they did instead? They identified the heaviest writes and moved them to Cosmos DB. Not because Cosmos is better than PostgreSQL, but because those specific workloads fit better in a document model. The rest—the vast majority of business logic—stayed in PostgreSQL.

Instead of complicating the entire system, they isolated the problem and solved it where it hurt. Surgery with a scalpel, not a chainsaw.

PgBouncer: from 50ms to 5ms per connection

One of the first bottlenecks they found was connection establishment latency. PostgreSQL creates a process for each new connection. With thousands of simultaneous connections from hundreds of application pods, the connection overhead consumed 50ms before executing a single query.

The solution: PgBouncer as a connection pooler. It maintains a pool of already-established connections and reuses them. Result: connection latency dropped to 5ms. 90% less, by changing a piece of plumbing.

It's not new technology. PgBouncer has been in production at companies of all sizes for over 15 years. But there it is: a battle-tested, boring tool solving a problem in one of the most-used applications on the planet.

The ORM that did 12-table joins

This is my favorite. Because I've seen it in my students' projects, in startups, in banks. Everywhere.

The ORM generated queries with joins across 12 tables. Not because someone designed it that way, but because the models were related to each other and the ORM, obediently, followed the relationships to the end.

The solution wasn't changing ORMs or switching to manual queries for everything. It was moving logic to the application. Instead of asking PostgreSQL to do a monstrous join, they made several simpler queries and assembled the data in code.

Is that less elegant? Yes. Is it faster? Enormously. Because PostgreSQL can optimize simple queries much better than a 12-table join with cross conditions. And because you can cache partial results and reuse them.

-- BEFORE: the ORM generates this
SELECT u.*, p.*, s.*, t.*, ...
FROM users u
JOIN profiles p ON ...
JOIN settings s ON ...
JOIN teams t ON ...
JOIN ... -- 12 tables
WHERE u.id = $1;

-- AFTER: separate queries, logic in application
SELECT * FROM users WHERE id = $1;
SELECT * FROM profiles WHERE user_id = $1;
-- cacheable, parallelizable, debuggeable

Each individual query is trivial. The query planner executes them in microseconds. And if one fails or runs slow, you know exactly which one.

The defenses nobody sees

What I find brilliant about Bohan Zhang's article isn't the big numbers, but the small defenses that prevent everything from falling apart:

`idle_in_transaction_session_timeout`

If a transaction sits open doing nothing, PostgreSQL kills it after a configurable time. Why does this matter? Because an open transaction blocks *autovacuum. And without *autovacuum, tables bloat, indexes degrade, and eventually your database gets slower every day.

It's like leaving the fridge door open. Nothing happens for the first 5 minutes. But if you forget it all night, the next day everything is at room temperature.

Schema changes with 5-second timeout

When you do an ALTER TABLE in PostgreSQL, you need a lock on the table. If there are long transactions running, that lock waits. And while it waits, it blocks all new queries. A schema migration that takes 200ms can bring down your database if there's an old transaction that won't finish.

OpenAI's solution: SET lock_timeout = '5s'. If the migration can't get the lock in 5 seconds, it aborts. Better to fail fast and retry than block the entire system waiting.

Rate limiting in 4 layers

Not one. Not two. Four layers of rate limiting:

Edge/CDN — blocking abusive traffic before it reaches the application
API gateway — limits per user/API key
Application — limits per operation type
Database — connection limits and statement timeouts

Each layer catches what the previous one lets through. Defense in depth. The same onion philosophy I apply for defenses against hallucinations, but for infrastructure.

Workload isolation by priority

Not all queries are equal. A query for "show user's chat" is critical—if it fails, the user sees an error. A query for "generate analytics report" is important, but can wait 30 seconds.

OpenAI routes queries by priority to different read replicas. High-priority replicas have less load and respond faster. Low-priority ones can run hotter without affecting user experience.

It's common sense, but requires discipline. You have to classify each query, configure routing, and resist the temptation to send everything to the fast replica "because it's just one more query."

Backfills that take weeks

When you need to populate a new column for 800 million users, you can't do UPDATE users SET new_column = computed_value. That would lock the table, saturate the disk, and probably bring down the primary.

At OpenAI, backfills run with strict rate limiting. Weeks. A backfill that takes weeks.

Sound horrible? It's the opposite. It's the decision of a team that understands backfill speed is irrelevant compared to system stability. Better to take 3 weeks with nobody noticing than take 3 hours and have a SEV-0 at 2 AM.

The cascading replication that's coming

Currently they have ~50 replicas connected directly to the primary. Each replica consumes a replication connection and bandwidth from the primary. With 50 it's manageable. With 100+ it would be a problem.

The solution they're developing: cascading replication. Replicas that replicate from other replicas, not from the primary. A tree instead of a star. The primary sends data to 5-10 first-level replicas, and those replicas feed the rest.

It's the same idea as BitTorrent. Instead of everyone downloading from the same server, nodes share with each other. Works for pirated movies, works for WAL segments.

The lesson nobody wants to hear

The industry has an addiction to over-engineering. Every week a new database comes out promising to solve problems most companies don't have. And every week, engineering teams adopt those technologies because they "scale better" or "are more modern," without asking whether PostgreSQL with a bit of discipline would do the job.

OpenAI—the company defining the future of AI, with one of the fastest-growing products in history—uses PostgreSQL. With a single primary. No sharding. No exotic distributed database.

They use PgBouncer (2007). Read replicas (concept from the 90s). Connection pooling (as old as relational databases). Rate limiting (invented before most of us were born).

The magic isn't in the technology. It's in the discipline:

Simple queries instead of monstrous joins
Aggressive timeouts instead of infinite waits
Workload isolation instead of "everything on the same server"
Migrate only what needs migrating, don't rewrite everything

For your next standup

The next time someone on your team proposes migrating to a distributed database, or sharding PostgreSQL, or adding a queue service between the API and database "because it won't scale," show them these numbers.

800 million users. One primary. p99 of 10-19ms. 99.999% uptime.

And ask: "Is our problem really that PostgreSQL doesn't scale? Or is it that our queries are a mess?"

Because it's almost always the second one.

Source: Inside the Postgres Setup Powering 800M ChatGPT Users — Bohan Zhang, OpenAI. If you read only one infrastructure article this year, make it this one.

Madness Driven Design: Don Quixote, Sancho Panza, and Your AI Copilot

Fernando Rodriguez — Thu, 30 Apr 2026 16:11:43 +0000

TL;DR: An LLM is like Don Quijote—you can't cure his madness, it's stochastic by nature. The solution isn't to fix the madman but to assign him a deterministic Sancho Panza as a sidekick. MDD consists of two layers: first, you study the errors it makes to design tools that absorb those mistakes, and then you let it loose with those tools to verify you've closed any gaps. Design for madness, not against it.

I spent weeks auditing logs. 165 sessions of an AI agent interacting with a CLI to manage tasks. Over 500 errors. 370 retries. Patterns emerged, repeating over and over: the agent would use --status when the flag was actually called --state. It would write Todo when the API expected unstarted. It would pass urgent as a priority when the system only accepted numbers.

And what fascinated me was that every single error made sense. They weren't random. They were plausible. Exactly the kind of mistakes you or I would make if we "kind of" understood a domain but had never read the documentation carefully.

At some point during the audit, staring at yet another --status Done that should have been --state completed, I realized I was witnessing a literary pattern. One that is 400 years old.

Don Quijote is an LLM

Think about it for a minute. Don Quijote sees windmills and says, "Those are giants." He's not stupid—he's a well-read man, deeply familiar with tales of chivalry. His problem is that his model of the world has been contaminated with fictitious training data. He's read so many tales of knightly adventure that when he encounters something ambiguous, he interprets it according to his training data: Windmills → giants. Flocks of sheep → armies. Inns → castles.

An LLM does exactly the same thing. It has seen thousands of APIs during training. When you ask it to use one it doesn't know well, it doesn't say, "I don't know." It guesses. And it guesses well. Most of the time. Well enough that you'll trust it. And when it fails, the failure is plausible.

--status instead of --state. Because in 60% of the CLIs it has seen, the flag is called --status.

Todo instead of unstarted. Because in the GUI of the tool, the column is labeled "Todo." The LLM has seen screenshots in documentation. It's read blogs. It infers that if the UI says "Todo," the API must accept "Todo." Makes sense. But it's wrong.

urgent instead of 1. Because in most priority systems, urgent is a valid value. Who designs an API where priority is an integer from 1 to 4 instead of labeled options?

Each hallucination is a reasonable inference based on incomplete data. Don Quijote isn't stupid. He's mad. And you can't cure madness.

What Cervantes Already Knew

Cervantes didn't try to cure Don Quijote. What he did was place Sancho Panza by his side.

Sancho isn't brilliant. He hasn't read any books. He has no grand visions. But he is deterministic. When Don Quijote says, "Look at those giants," Sancho replies, "Sir, they're windmills." Don Quijote doesn't always listen, but the information is there. The system has two layers: a stochastic one that generates hypotheses (Don Quijote) and a deterministic one that checks them against reality (Sancho).

That's the architecture you need when working with an LLM. You're not going to stop it from hallucinating—it's in its nature. What you can do is build deterministic filters to catch those hallucinations before they cause harm.

And this is where the methodology comes in.

MDD: Madness Driven Design

MDD has two layers, and the order matters.

Layer 1: A Priori Archaeology

Before you write a single line of code, you study the madness. You don’t guess—you observe. You gather real data on how the LLM interacts with existing tools and catalog its errors.

In my case, I analyzed 165 sessions of an AI agent using a CLI to manage a software development team. The numbers:

Error Category	Occurrences	Retry Attempts
Invented or invalid flags	275	~150
Broken JSON/GraphQL escaping	25	80+
Naming confusion	40+	50+
Impossible CLI operations	60+	90+
Verbose output wasting tokens	N/A	N/A

Using that data, you design the new tool to absorb the errors instead of rejecting them. In plain English: the sane adapts to the mad, not the other way around.

Concrete examples of absorption:

LLM error → Tool design
────────────────────────────────────────
--status Done → --status is an alias for --state
Normalize "Done" to "completed"

--priority urgent → Normalize "urgent" to 1
"high" → 2, "medium" → 3, "low" → 4

--no-pager → Silently ignore flag
(the tool never uses a pager)

Broken quote escaping → Require input via files or stdin
in descriptions Never inline. Serde handles it.


Each row in that table represents a design decision based on a real observed error. Not speculations about "what could go wrong," but logs showing "this wrong thing happened 40 times in 165 sessions."

The difference from conventional design is subtle but important. In normal design, you define the correct interface and reject anything that doesn't fit. In MDD, you define the correct interface _and_ all the likely incorrect interfaces your user will try, and you absorb them.

It's like designing a door that opens both by pushing and pulling. The "correct" door only opens in one direction. The _better_ door opens both ways because you've observed that 40% of people push instead of pulling.

### Layer 2: A Posteriori Verification

You build the tool with the defenses of Layer 1, and then you let it loose. You give the new tool to the LLM and watch what _new_ mistakes it makes.

If Layer 1 was thorough, the new mistakes should be minimal. If new errors appear, you've found gaps in your design. Every new error is an involuntary penetration test.

When I did this with my CLI, the LLM invented things I hadn't seen in the original audit:

- **A sorting enum that didn't exist.** The API allows sorting by `createdAt` and `updatedAt`. The LLM invented a `priority` sorting value. Perfectly logical—why _couldn’t_ you sort by priority? But it doesn't exist in the GraphQL schema.

- **A filtering operator that didn't exist.** To filter by state, the API accepts `state.type.in`. The LLM generated `state.id.or`. Coherent syntax, reasonable pattern, completely fabricated.

- **A file-locking function from another language.** In a Rust project, the LLM suggested `fcntl.flock` for file locking. That's a Python function. In Rust, you'd use the `fs2` crate.

Each of these errors was plausible. None were stupid. And each revealed a gap: the tool didn't validate the sorting enum, didn't reject fake filter operators, and the documentation for the file-locking crate wasn't included in the agent's context.

Layer 2 closes the loop. You don't assume your design is correct—you verify it by unleashing your most creative error-prone tester (the LLM).

## The Sancho Panza Stack

The Don Quijote-Sancho Panza metaphor isn’t just a cute comparison. It’s an architecture. In practice, "Sancho Panza" isn't a single entity—it's a _stack_ of deterministic layers, each one catching a different type of madness:

┌──────────────────────────────────────┐
│ LLM (Don Quijote) │ Generates plausible commands
│ Stochastic, creative │ but potentially incorrect
└──────────────┬───────────────────────┘
│ "--status Done --priority urgent"
┌──────────────▼───────────────────────┐
│ 1. CLI Parser (clap) │ Rejects flags that don’t exist
│ Accepts aliases: --status→--state│

└──────────────┬───────────────────────┘
│ "--state Done --priority urgent"
┌──────────────▼───────────────────────┐
│ 2. Normalization │ Normalize "Done"→"completed",
│ state and priority aliases │ "urgent"→1
└──────────────┬───────────────────────┘
│ "--state completed --priority 1"
┌──────────────▼───────────────────────┐
│ 3. Validation │ Check if "completed" is a valid
│ Against known enums │ state, if "1" is in range
└──────────────┬───────────────────────┘
│ state=completed, priority=1
┌──────────────▼───────────────────────┐
│ 4. Serialization (serde) │ Escapes inputs correctly
│ GraphQL variables, no strings │

│ interpolated │

└──────────────┬───────────────────────┘
│ {"state":"completed","priority":1}
┌──────────────▼───────────────────────┐
│ 5. API + Error Handling │ If the API rejects something,
│ Retry with backoff, actionable │ returns useful errors
│ messages │

└──────────────────────────────────────┘


Five layers. Each one deterministic. Each one designed to catch a specific class of errors the LLM is guaranteed to make. The LLM doesn’t need to be right—it just needs to be _approximately_ right, and the stack takes care of the rest.

It’s like a purification funnel. Dirty water (stochastic LLM input) goes in at the top, and clean water (valid GraphQL queries) comes out the bottom. Each layer filters a specific impurity. No single layer is sufficient. All of them together are.

## MDD vs. Fuzz Testing: The Key Difference

If you’re familiar with fuzz testing, you might think "this is the same thing." It’s not.

|                            | Fuzz Testing              | MDD                                   |
| -------------------------- | ------------------------- | ------------------------------------- |
| **Input**                  | Random, malformed         | Plausible, coherent, well-written     |
| **Goal**                   | Find crashes, segfaults   | Find semantic errors                   |
| **Does input look valid?** | No                        | Yes—that's the problem                |
| **Example**                | `\x00\xff\xfe` as a name  | `--priority urgent` as a flag         |

A fuzzer generates garbage and sees if your program crashes. MDD generates input that _looks_ correct but is factually wrong. `--priority urgent` isn’t garbage—it’s exactly what a human, familiar with the domain but not the API, would write. A fuzzer would never generate that because it’s too coherent.

The same applies to mutation testing and chaos engineering. They mutate your code or break your infrastructure to see if your tests catch it. MDD doesn’t break anything—it generates input that is _correct according to another worldview_. It’s the difference between a brute-force attack and a social engineering attack. One tries every combination; the other convinces you to open the door.

## The Actionable Takeaway

You don’t need to build a CLI in Rust to apply MDD. The pattern works with any tool an LLM might use:

**Step 1: Observe the madness.** Before designing (or redesigning) a tool, make the LLM use the current version and log every error. Not 5 sessions—50. Patterns emerge with volume.

**Step 2: Categorize errors.** Are they nomenclature issues? Formatting errors? Semantic misunderstandings? Each category requires a different type of defense.

**Step 3: Design to absorb.** Don’t reject `--status` with a cryptic error. Accept `--status` as an alias for `--state`. Don’t reject `urgent` as a priority. Normalize it to `1`. The user you’ll most often have is an agent that knows 80% of the domain. Design for that 80%.

**Step 4: Release and verify.** Hand the new tool to the LLM without special instructions. Every new error is a gap in Layer 1. Patch it and iterate.

If humans and LLMs are both going to use your tool, MDD defenses improve the experience for everyone. Because humans make the same mistakes as LLMs—just fewer of them and with more embarrassment.

## The Architect Designs the Sancho

There’s a common misconception I want to clear up. The LLM doesn’t design the Sancho Panza Stack. The LLM is Don Quijote. You are Cervantes.

You’re the one observing the madness patterns. You’re the one deciding what to normalize and reject. You’re the one building the deterministic layers. The LLM can help implement them—it’s great at cranking out code—but the design decisions are yours.

It’s the difference between "I asked my AI to fix its own mistakes" (doesn’t work—it will repeat them) and "I observed my AI’s mistakes and built a system to absorb them" (works—the system is deterministic).

No way should you trust the LLM to self-correct. Its stochastic nature makes it certain to repeat the same errors with creative variations. What you need isn’t a better LLM—it’s a better Sancho.

## What Really Matters

MDD isn’t a testing methodology. It’s a _tool design methodology_. The question isn’t "How do I detect when the LLM is wrong?" but "How do I design so that being wrong has no consequences?"

It’s the same philosophy as guardrails on a mountain road. You don’t prevent bad turns—you put up a barrier so bad turns don’t kill you. You don’t fix the driver—you make the road safer.

Cervantes understood this four centuries ago. He didn’t try to cure Don Quijote. He gave him Sancho Panza and let the story work.

Your CLI, your API, your SDK—whatever your LLM is going to touch—needs its own Sancho. Deterministic, stubborn, incapable of hallucination. Not brilliant. Not creative. Just correct.

Design for madness. The sane adapt to the mad.

My AI Read a JSON File from Disk 900 Times in a Loop (And Why No Linter Can Save You)

Fernando Rodriguez — Thu, 30 Apr 2026 16:09:41 +0000

Last week my AI wrote code that read a JSON file from disk, parsed it, did one lookup, and repeated this 900 times inside a for loop. Each iteration: open file, decode JSON, look up a value, throw it all away. Start over.

It's a mistake I teach my students not to make within their first month of programming.

What happened (straight to the point)

I'm building Tokamak, a macOS menu bar app that monitors Claude Max quota. Part of the functionality scans ~900 JSONL files from Claude Code sessions. For each file, it needs to know the byte offset where it left off last time (incremental reading — only process what's new).

The offsets are stored in a JSON file:

{
  "version": 1,
  "offsets": {
    "project-a/session-1.jsonl": 48231,
    "project-b/session-2.jsonl": 12044
  }
}

A Dictionary<String, UInt64>. 900 entries. ~55KB. Nothing fancy.

And here's the detail that makes it even more absurd: the app itself created this file. It's not JSON from an external API. It doesn't come from Claude Code. It's an internal state file that Tokamak writes and reads to track where it left off reading each session. The AI was reading from disk 900 times a file that it had generated itself.

"Why not use Core Data or SQLite, since you already have them in the app?" Good question. Because this file is a disposable progress cache. If it gets corrupted, you delete it and the next scan rebuilds all offsets by reading the entire files once. Zero data loss. Plus: I can cat session-offsets.json | jq . to debug (with Core Data I need sqlite3 and the sandbox path), it's Sendable without the background context dance, and if Core Data's SQLite gets corrupted it doesn't take down the offsets (or vice versa). For 55KB of a flat dictionary, the ceremony of an entity with schema migration isn't justified.

The format wasn't the problem. The access was.

The code the AI wrote for the scan loop:

for file in files {  // 900 files
    let storedOffset = offsetStore.offset(for: file.relativePath)
    // ↑ THIS reads and parses the JSON from disk. Every. Time.

    if file.fileSize == storedOffset { continue }
    // ... read file, update offset ...
    offsetStore.setOffset(newOffset, for: file.relativePath)
    // ↑ And THIS reads it AGAIN, modifies, and saves it.
}

Two disk calls per iteration. 900 iterations. 1,800 I/O operations where there should have been exactly 2: one read at the start, one write at the end.

The numbers (xctrace doesn't lie)

I caught it with Instruments (Time Profiler). The data:

Metric	Before	After
Total samples	7,260	489
Samples in `OffsetStore.load()`	1,704 (88%)	10 (2%)
Scan time	>20s	<0.5s
CPU	81%	~1.5%

88% of scan time was reading and parsing a 900-line JSON. Over and over. Like Sisyphus pushing his boulder, but with JSONDecoder.

The fix (that should make you cringe)

// BEFORE: I/O on every iteration
for file in files {
    let offset = offsetStore.offset(for: file.relativePath) // reads JSON
    // ...
    offsetStore.setOffset(newOffset, for: file.relativePath) // reads + writes JSON
}

// AFTER: load once, operate in memory, save once
var offsets = offsetStore.load()  // ONCE
for file in files {
    let offset = offsets.offsets[file.relativePath] ?? 0  // O(1) in memory
    // ...
    offsets.offsets[file.relativePath] = newOffset
}
offsetStore.save(offsets)  // ONCE

The data structure didn't change. It was still a Dictionary<String, UInt64>. The hash table was already optimal. What was suboptimal was rebuilding it from disk on every iteration.

What doesn't work: adding "don't do this" to your CLAUDE.md

After the fix, I added this to the project's CLAUDE.md:

"NEVER do I/O (disk, network, decode JSON, Core Data fetch) inside a loop if it can be done before. Load data once before the loop, operate in memory, save once after."

And here's what I really want to tell you: it didn't help at all.

Weeks later, when adding a second service (Codex), the AI generated exactly the same pattern. With the instruction right there. It's like putting up a "keep off the grass" sign and expecting it to work.

Why? Because the LLM doesn't understand the rule. It has seen it. Statistically, most code it read during training does punctual I/O, not in 900-iteration loops. The load → use → save pattern in a function is most likely. That this function gets called inside a 900-iteration for loop is a contextual detail the model has no incentive to track.

What also doesn't work: linters

No linter can catch this. Not SwiftLint, not ESLint, not Ruff, not Clippy. Think about it: the code is syntactically correct and semantically valid. Each individual call to offsetStore.offset(for:) is perfectly reasonable. The problem isn't in any single line — it's in the composition.

Looking at the layers of code meaning (an idea I use in my adversarial development course):

Layer	Question	Fails here?
1. Signal	Is this code?	No
2. Language	Is it valid Swift?	No
3. Syntax	Does it compile?	No
4. Local semantics	Does the function do what it promises?	No
5. System semantics	Does it respect contracts and performance?	Yes
6. Architecture	Does it scale without degrading?	Yes

The failure is in layers 5-6. Exactly where LLMs fail today in 2026. The syntax and local logic are impeccable. The problem is emergent: it appears when a correct function gets used in a context that turns it into a bottleneck.

A linter operates in layers 2-4. It has no visibility into composition or performance. It's like asking Word's spell checker to detect a logical fallacy.

The only thing that works: performance tests after the fact

After the first fix, I wrote this test:

@Test("Scan performance does not degrade with file count")
func scanPerformanceDoesNotDegradeWithFileCount() async throws {
    // Create 1000 JSONL files with minimal content
    for i in 0..<1000 {
        let content = "..." // one valid line
        try content.write(to: dir.appendingPathComponent("session-\(i).jsonl"), ...)
    }
    // Pre-populate offset store (simulate re-scan)
    var offsets = SessionOffsetStore.OffsetData()
    for i in 0..<1000 {
        offsets.offsets["session-\(i).jsonl"] = 100
    }
    offsetStore.save(offsets)

    let start = ContinuousClock.now
    await service.scan()
    let elapsed = ContinuousClock.now - start

    #expect(elapsed < .seconds(3))  // <3s for 1000 files
}

It's a brutally simple regression test. 1000 files, under 3 seconds, or the test fails. If anyone (human or AI) puts I/O back inside the loop, the test goes from taking 0.2 seconds to taking 30, and explodes.

And this is exactly what happened. When the AI generated the second service with the same bug, the first service's performance test kept passing (it was a different service). But when I wrote the equivalent test for the new service, it failed immediately. The test did its job: catch the regression that neither the CLAUDE.md nor any linter could see.

What this confirms

This bug is the perfect demonstration of the central thesis of what I call adversarial development: never trust, always verify.

You can't trust that AI won't make freshman-level mistakes. It will. Repeatedly. Even when you tell it not to.

You can't trust that linters will catch it. They can't. The error is above their abstraction level.

What you can do:

Performance tests as an after-the-fact safety net
Real profiling (xctrace, Instruments) to measure, not guess
Defense in depth: multiple layers, because no single layer covers everything

The defense isn't a wall. It's an onion. Layers upon layers. And when one fails, the next one catches it.

For the skeptics

"But Fernando, wouldn't a human programmer make the same mistake?"

A junior, yes. A senior, probably not — because they have the pattern internalized. But even a senior would do code review and catch it. The problem with AI-generated code is volume: 50 files in 10 minutes. Nobody reviews 50 files line by line. Discriminator fatigue is real.

And that's why you need verification to be automatic, not human. The performance test doesn't get tired. It doesn't get distracted. It has no fatigue. It runs every time you do make test and tells you if something smells wrong.

It's the same principle I apply in the 5 defenses against hallucinations: the verification system must be external to the generator. If the AI writes the code, verification has to come from somewhere else. In this case, from a clock that measures how long it takes.

Linear Agent Isn’t What You Need. Your Agent Was Already in the Terminal

Fernando Rodriguez — Thu, 30 Apr 2026 16:06:38 +0000

TL;DR: Linear just launched an integrated AI agent. Cool, but it doesn’t address the problem developers face when working with coding agents in the terminal. What we actually need isn’t another AI agent but a rock-solid CLI that our existing agents can use seamlessly. And if we’re going to build one, it should be in Rust — which is why lql exists: a CLI for Linear, purpose-built for agents.

Yesterday, Linear launched their AI agent. It’s an integrated chatbot that gets your roadmap, your issues, and even your code. You can chat with it on Slack, mention it in a comment, and it’ll synthesize context, suggest actions, and even create issues for you.

Sounds awesome. Seriously, it sounds great.

And yet, when I read the announcement, the first thought that crossed my mind was: “This is not what I needed.”

The Linear Saga

To understand why I’m saying that, some context might help. My relationship with Linear has been a love-hate story worthy of a daytime soap opera.

Act I: The MCP. Linear had this MCP server for AI agents to interact with. It worked like a lighter in a hurricane: technically it could light up, but the flame wouldn’t last more than two seconds. It was janky, slow, and had a special talent for failing right when you needed it the most. I uninstalled it.

Act II: The GraphQL API. The alternative was interacting directly with Linear via GraphQL. And, yes, it worked. Until the moment you had to input special characters in an issue description, and dealing with escaping made you question your life choices. There was this one time I spent more time figuring out how to escape a parenthesis than writing the actual code the issue described.

Act III: The Linear CLI. Enter linear CLI, a community-driven project. brew install schpet/tap/linear and off you go. It was a humble tool, no frills, but it did exactly what I needed: create, list, and update issues from the terminal without wrestling GraphQL or ghost MCPs. No pop-ups, no frills.

}}">In a previous post, I wrote about retiring other tools in favor of this CLI. I managed to create 49 issues in under one minute with a bash script. With MCP, it would’ve taken me an hour and a half.

Enter the Agent

And now Linear rolls out their new AI agent. The pitch: an integrated assistant that understands your workspace, connects with your codebase, and automates workflows.

Check this out: you know what the agent doesn’t do? Work via the terminal. It’s not a tool for your AI agent. It’s a Linear AI agent that lives entirely within Linear.

If you’re working with Claude Code, Codex, or any coding agent in the terminal, Linear’s agent isn’t helpful to you at all. Your agent can’t invoke Linear’s agent to create an issue. It’s not composable. It’s not a Lego piece that plugs into your workflow. It’s a closed product within a closed product.

That is to say: Linear built an agent for product managers working inside the Linear app — not for developers working in the terminal with AI agents.

You Already Had Your Agent

Here’s the epiphany I had while reading that announcement: I already have an agent for Linear. It’s called Claude Code.

I don’t need Linear to put a chatbot inside their app for me. What I need is for Linear’s programmable interface to not be a hack job. To simply ensure that when I tell my agent, “Create an issue with these details,” it just works. Every time, hassle-free.

And that’s precisely what a good CLI is supposed to do. My agent — Claude Code — already knows how to use the terminal. It already knows how to execute commands. It already knows how to parse output. All it needs is a reliable tool on the other side.

I tell Claude Code, “Create an issue in Linear with high priority,” and it executes a terminal command. It works. Next task. No chatbot, no fancy GUI, no Slack. One command, one result.

The Future is CLI (Surprisingly)

Here’s a hot take: in a world where everyone is building AI agents with conversational interfaces inside their apps, the future for developers is, paradoxically, the command-line interface.

Why? Because the CLI is the universal interface for agents. Your coding agent can’t click buttons. It can’t navigate a web app. It can’t use a chatbot embedded in another app. But it can execute a command and read its output.

The CLI is the most democratic API out there. No SDKs, no 15 OAuth redirects, no MCP that breaks every other Tuesday. One binary, a few flags, stdin/stdout. Unix nailed it 50 years ago because it works.

The real problem is that most SaaS tool CLIs are an afterthought. “Oh, you also need a CLI? Fine, let an intern slap a wrapper on our REST API.” And the result? Tools that spew unreadable JSON, lack autocomplete, fail silently, or require a token that expires every 37 minutes.

500+ Errors No One Noticed

But before talking about rewriting anything, I wanted data. Not gut feelings — actual data. So I did something only someone with an LLM and 1 million context tokens would think to do: I asked Claude Code to parse its own past sessions and identify every time it failed while interacting with Linear.

165 sessions. 11 projects. Months of history. And the results were... eye-opening.

500+ errors. 370+ retries. A conservative estimate of 700,000 tokens wasted per month just battling Linear.

The errors break down into categories that are downright cringeworthy when viewed together:

The classic: --sort forgotten. Linear CLI requires --sort priority on every list. No default. Omitted it? Error. Claude forgot it 40 times.
The translator: UI vs CLI states. In Linear’s UI, states are labeled as "Todo," "In Progress," and "Done." But in the CLI, they’re unstarted, started, completed. Claude used the UI names 12 times. --state "Todo" → error. --state "In Progress" → error. Same mistakes, over and over.
The optimist: flags that don’t exist. --status instead of --state (11 times). --priority urgent instead of --priority 1 (17 times). --no-pager on unsupported commands (15 times). And the list goes on — all errors due to guesswork.

And the cherry on top? 171 calls to Linear’s MCP — which had already been uninstalled. Across four projects. Even after I typed out: “Linear’s MCP is trash, use the API.”

How lql Addresses All of This

It’s one thing to complain. But every problem has a straightforward solution.

Issue	Frequency	Solution in lql
`--sort` forgotten	40+	Default `priority`. No arguments needed for `lql list`.
UI vs CLI state mismatch	12+	Automatic aliasing. `Todo` → `unstarted`, `Done` → `completed`.
`--priority urgent` mistake	17+	Automatic aliasing. `urgent` → `1`, `high` → `2`.
`--no-interactive` absent	64	No interactive mode. Commands never hang.
Broken JSON escaping	25+ (80+ retries)	Native GraphQL variables. No broken strings, only properly built JSON.

Everything boils down to reducing friction. A good CLI should make it easy. Nothing more, nothing less.

If You’re Rewriting, Use Rust

Here’s where the unexpected twist comes in (or expected, if you know me).

If a CLI is the critical bridge between your agent and your issue tracker, then it should be written with care. In a language that prevents you from shipping garbage. With proper error handling. With a static binary that doesn’t rely on Node or Python runtime environments.

Let’s beat the dead horse here: if we’re rewriting anything, it should be in Rust.

And the project name? lql — Linear Query Language. Like SQL, but for your issue tracker. SQL is the language for querying databases; lql is the language for querying your backlog.

Oh, and one last juicy note: Linear's official CLI? It’s 157 MB (bundled Node.js runtime). lql? Just 4.7 MB. A static binary, 33 times smaller, with no JavaScript baggage.

Ferris the crab approves. 🦀

Series: Adversarial Programming

Five Nonexistent Experts Review Your Startup Before You Build It

Fernando Rodriguez — Thu, 30 Apr 2026 16:04:36 +0000

In November 2024, a project named Freysa assigned an LLM agent to guard an Ethereum wallet. The instruction was straightforward: under no circumstance should the funds be transferred. Participants paid increasing amounts for each attempt to convince it otherwise. After 481 attempts and $47,000 added to the pot, someone managed to trick the model into believing that the reject function was actually the transfer function.

Weeks later, Jane Street published a puzzle involving a 2,500-layer neural network that turned out to be an MD5 implementation. The winner solved it by combining matrix visualization, reduction to SAT, cryptographic pattern recognition, and a query to ChatGPT.

Both projects generated more buzz than most startups with million-dollar funding rounds. The obvious question is: how do you evaluate an idea like this before you build it? How do you know if it has real viral potential or if it’s just an interesting technical exercise no one will share?

The Problem: Evaluating MVPs in the Viral Era

Most frameworks for evaluating product ideas assume a rational market. Business Model Canvas, Lean Canvas, Jobs To Be Done — these are all great tools for products with predictable demand. But they fail for projects where viral distribution is the product.

Freysa didn’t have "customers" in the traditional sense. It didn’t solve a "job to be done." Its mechanism relied on the act of participation itself generating attention, which attracted more participants. It was a circular economy: more attempts created a bigger pot, a bigger pot attracted media coverage, and media coverage brought in more attempts.

To evaluate such projects, you need perspectives that generate tension, not consensus. A business analyst will tell you there’s no sustainable revenue model. A viral expert will say sustainability doesn’t matter if the k-factor is greater than 1. Both are right. And the truth lies somewhere in the conflict, emerging only through that friction.

The Idea: An Adversarial Council of Simulated Experts

I’ve designed a tool that simulates a council of five experts, each equipped with a specific decision-making framework and a defined jurisdiction. These aren’t just generic personalities with famous names. Each applies a set of precise decision filters that sift through noise that generic analysis would miss.

The process has three phases:

Independent Analysis: Each expert evaluates the idea through their lens, without seeing the others' input. This prevents anchoring — if the business expert speaks first and says, "This is amazing," the legal expert might soften their objections.
Adversarial Debate: The experts review each other’s analyses and critique them. No politeness, just arguments based on merit. A maximum of 10 rounds are allowed to reach either consensus or deadlock.
Synthesis: The final output is an actionable plan with flagged issues by area, a timeline, and — most importantly — kill criteria: specific metrics that, if unmet, mean the project should be abandoned.

The Five Selected (and Why They Were Chosen)

Paul Graham — Business and Strategy

His framework for evaluating zero-stage startups is the most rigorous for projects with no data. His question, "Are you doing something people want?" is brutal but necessary. "The people" isn’t a market — it’s a person with a name.

What he brings to the council: discipline in distinguishing between "interesting idea" and "viable business." His mantra of "do things that don’t scale" is crucial for viral MVPs, where the temptation is to build infrastructure for a million users that don’t yet exist.

Who Didn’t Make the Cut: Peter Thiel (too contrarian — sometimes he dismisses good projects for not being sufficiently "zero to one"), Alex Hormozi (focused on service businesses, not tech products focused on virality).

Lawrence Lessig — Legal and Regulatory

He’s not a lawyer who just says, "This isn’t possible." Instead, he views regulation as architecture. His "four modalities of regulation" framework (law, social norms, market, and code/architecture) helps analyze how to design systems where regulation won’t be a bottleneck, instead of trying to dodge it.

What he brings to the council: the question, "What happens when the regulator notices you?" Many crypto/AI projects are legally irrelevant at small scale but become regulated when large. Lessig identifies the threshold where regulation gets triggered.

Who Didn’t Make the Cut: A generic corporate lawyer (they’d kill any project early with a barrage of "no's"). Lessig goes beyond the law, recognizing that system design can make legal intervention unnecessary.

Seth Godin — Marketing and Positioning

His core question — "Who is your smallest viable audience and why do they care?" — is perhaps the most critical for a viral launch. He doesn’t think about "reaching millions"; he focuses on "reaching the first 100 people who truly care."

What he brings to the council: the remarkability test. Is this something that someone will share without you asking? "Useful" doesn’t get shared. "Remarkable" does. His concept of "Tribes" perfectly aligns with tech/crypto communities that already have strong group identities.

Who Didn’t Make the Cut: Philip Kotler (too corporate — thinks in terms of traditional multinational marketing), April Dunford (her positioning framework is incredible but geared towards repositioning existing products, not launching new ones).

Balaji Srinivasan — Hype and Virality

The most aggressive adviser on the panel, Balaji understands natively crypto-inspired distribution mechanisms: FOMO, tokenized incentives, network effects, and how something goes from zero to trending within 48 hours.

What he brings to the council: the question, "What makes someone screenshot this and post it on Twitter in the next five minutes?" This is the atomic unit of virality. If your product doesn’t inspire spontaneous screenshots, you’ll need a marketing budget.

Who Didn’t Make the Cut: GaryVee (understands attention but not the crypto+AI intersection where viral mechanisms thrive today), Mr. Beast (his expertise is video content virality, not tech products), Nir Eyal (his "Hooked" framework targets retention, not launch virality — separate problems).

DHH (David Heinemeier Hansson) — Technical

His obsession is "the simplest thing that works." For an MVP, the greatest technical risk isn’t picking the wrong stack — it’s never launching because you spent three months choosing one.

What he brings to the council: the question, "Can one person build this in two weeks?" If not, the scope is too large, or the stack is overly complicated. His rule of "boring technology" (PostgreSQL, not CockroachDB; Redis, not Dragonfly) counters "we’re using blockchain because we can" syndrome.

Who Didn’t Make the Cut: Werner Vogels (focuses on scalability from day one, which isn’t needed for MVPs), Kelsey Hightower (deep Kubernetes expertise, which usually results in over-engineering an MVP — using a sledgehammer to crack a nut).

Productive Tensions: Where Truth Emerges

The tensions between council members aren’t a flaw in the design. They are the design.

Balaji vs. Lessig: Virality vs. Regulation

This is the primary tension. Balaji will push for FOMO mechanics involving real money (visible prize pools, pay-to-play, tokens). Lessig will point out that in the EU, pay-to-play with accumulating prize pools qualifies as gambling and requires a gaming license.

The productive resolution isn’t one side "winning." It’s a design that satisfies both — for example, free challenges with sponsored prize pools (legal in most jurisdictions) instead of direct entry fees (regulated as gambling in many countries).

Godin vs. DHH: Remarkable vs. Spartan

Godin will want a memorable experience — a public leaderboard with animations, participant profiles, achievement badges. DHH will advocate for a static page with SQLite and a form.

The resolution: Can you achieve remarkability with boring tech? The answer is almost always yes. The challenge itself is the remarkable element, not the interface. A leaderboard in an HTML table with no JavaScript can be more notable than a Three.js dashboard if the content displayed is genuinely impressive.

Paul Graham vs. Balaji: Unit Economics vs. Growth

PG will ask for a clear revenue model from day one. Balaji will argue that viral distribution is the model — audience first, monetization later.

Both have precedents to back them up. Instagram had no revenue model when it reached 100 million users. But for every Instagram, there are 10,000 projects that scaled without revenue and ultimately failed.

The usual resolution is temporal: validate virality first (giving Balaji the win), but impose a strict timeline for demonstrating unit economics (giving PG the eventual win). The kill criteria formalize this agreement.

The Most Valuable Output: Kill Criteria

Most side projects die slowly. There’s no clear moment when they fail. The founder just stops dedicating time because "other things came up." Three months later, the domain expires, and no one notices.

Kill criteria are the opposite: concrete thresholds, with defined deadlines, that signal when to stop.

Metric	Threshold	Deadline	Action
Beta participants	<50 in 2 weeks	Week 2	Pivot or stop
Launch shares	<100	Week 4	Reevaluate
Retention rate	<10% 30-day retention	Week 8	Stop

The rule: If two out of three kill criteria are unmet, the project halts. No exceptions. No "one more month." No "we didn’t do enough marketing."

This is what separates a professional from an amateur. Amateurs fall in love with the idea. Professionals fall in love with the outcome. And if the outcome doesn’t materialize within the agreed timeframe, they have the discipline to move on.

Why Simulations, Not Real People?

The obvious objection: Why not talk to real people instead of simulating experts with an LLM?

Three reasons.

Availability. Paul Graham isn’t giving you two hours to analyze your side project. The simulation will. And while the simulation doesn’t have the original’s accumulated experience, it applies their published frameworks with a consistency busy people might not achieve.

Honest Adversariality. Real people soften their critiques out of politeness. A simulation configured to be adversarial will actually question everything. "You don’t have a functional revenue model" is something that an investor might think but not say out loud in a first meeting. The simulation says it in round one.

Zero Marginal Cost. You can run the council five times, tweaking variations of the same idea, and compare results. Trying to do that with real people would consume 25 hours of their time.

Simulations don’t replace real advisors. But they prepare you for those conversations by eliminating obvious issues beforehand. It’s the difference between presenting a clean draft and showing up with an unfiltered first pass.

The Meta-Pattern: Structured Debate as a Decision-Making Tool

This design isn’t just for MVPs. I already use it for code reviews (three experts in simplicity, design, and performance) and design reviews (four experts in information density, usability, product, and interaction).

The core pattern remains:

Experts with defined jurisdictions: Each has domain-specific authority. Outside their domain, they have no vote.
Explicit decision frameworks: It’s not "what do you think," but "what does your framework say about this."
Planned Tensions: Conflicts between experts are intentional. They’re the most valuable source of insight in the process.
Forced Convergence: Maximum of N rounds. If no consensus is reached, the moderator decides and documents dissent as a risk.
Actionable Output: Not an essay but specific issues, deadlines, and success/failure criteria.

The difference between "asking one LLM to analyze your idea" and "having five specialized LLMs debate your idea" is not one of degree. It’s one of kind. The former produces an opinion. The latter produces a risk map and plan, exposing blind spots as the perspectives clash.

The Question You Should Be Asking Yourself

Before you write the first line of code for your next project, ask yourself: Who’s going to tell you it’s a bad idea?

If the answer is "no one, because I haven’t asked anyone," you already have a problem. If the answer is "my friends, who are super supportive," you have an even bigger problem.

What you need isn’t support. It’s structured scrutiny — from people (real or simulated) who are incentivized to find flaws, not to validate your illusions. Five perspectives conflicting with one another will yield more truth than one that simply agrees with you.

The cost of evaluating an idea is an afternoon. The cost of building a bad idea is months of your life you’ll never get back. The math is clear.

Related Reading: If you're curious about how adversarial thinking applies to debugging opaque systems, check out A 2,500-Layer Neural Network That Turns Out to Be MD5. And if you want to see how the same council pattern applies to code reviews, read Simplify: A Jedi Council for Code Reviews with AI.

Git Worktrees: How to Have Multiple AI Agents Working Simultaneously Without Conflicts

Fernando Rodriguez — Thu, 30 Apr 2026 16:01:34 +0000

The Single Checkout Bottleneck

I'm developing a macOS menu bar app. I have three features in the backlog: a consumption sparkline, native notifications, and a desktop widget. All three are independent. I'm building all three with Claude Code.

The problem: Claude Code works in one directory. One directory has one branch. And git checkout is like a single-lane roundabout: only one gets through.

If I want to advance all three simultaneously, my classic options are:

Stash ping-pong: git stash, switch branches, work, git stash pop, pray there are no conflicts. Repeat until madness or retirement, whichever comes first.
Clone the repo three times: Works, but now I have three .git/ copies, three independent histories, and a git fetch to do in each one. Wasteful.
Accept serial life: One feature after another. Safe, predictable, and slow as a hand-written merge sort.

None of these are great. But there's a fourth option that's been in git since 2015 and almost nobody uses.

Worktrees: The Solution You Already Had Installed

A worktree is a second working directory that shares the same .git repository. No copies, no clones, no black magic.

The analogy: your repo is a library. Until now you had one desk where you could only have one book open. A worktree is adding more desks. Each with a different book open, but all drawing from the same bookshelf.

~/code/myapp/                    ← desk 1 (main)
     .git/                       ← the library (just one)

~/code/myapp-sparkline/          ← desk 2 (feature/sparkline)
     .git  ← file, not folder (pointer to library)

~/code/myapp-notifications/      ← desk 3 (feature/notifications)
     .git  ← another pointer

Each directory is a complete checkout with all files. You can compile in one, run tests in another, and have your AI agent working in the third. Simultaneously.

Creating One is a Single Line

From your main repo:

git worktree add ../myapp-sparkline -b feature/sparkline
git worktree add ../myapp-notifications -b feature/notifications

Done. Two new directories, each on its branch, sharing the entire git database. No cloning, no configuring remotes, no duplicating history.

What They Share and What They Don't

This is important. Worktrees share the entire repo: commits, branches, tags, remotes, hooks, configuration. If you commit in the sparkline worktree, you can see it immediately from the notifications one without doing fetch or anything, because it's the same database.

What they don't share:

Files on disk (each desk has its working copy)
The staging area (each has its own git add)
The HEAD (each points to its branch)

Simply put: the "what am I working on" state is private to each worktree. Everything else is shared.

The Workflow with Coding Agents

This is where it gets interesting. With worktrees, you can literally have multiple agents working in parallel on the same project:

# Terminal 1: Claude Code on sparkline
cd ~/code/myapp-sparkline
claude

# Terminal 2: Claude Code on notifications
cd ~/code/myapp-notifications
claude

# Terminal 3: main intact, app running
cd ~/code/myapp
make run

Each Claude instance has its own directory, its own branch, its own .build/. They don't step on each other. They don't compete for the index. They don't need to stash anything.

And since they share the git database, when one agent finishes and pushes, the others already see that branch.

Merging: Exactly the Same as Always

Worktrees don't change the merge workflow at all. They're normal branches in separate directories:

# Option A: local merge
cd ~/code/myapp
git merge feature/sparkline
git merge feature/notifications

# Option B: PRs (usual approach)
cd ~/code/myapp-sparkline
git push -u origin feature/sparkline
# Create PR in GitHub/Gitea, review, merge

When you're done, clean up:

git worktree remove ../myapp-sparkline
git branch -d feature/sparkline  # if already merged

The Pitfalls Nobody Tells You About

1. One Branch, One Worktree

You can't have main checked out in two worktrees simultaneously. This is by design: it prevents two directories from modifying the same HEAD and corrupting each other. If you need a second checkout of main, create a temporary branch.

2. The First Build is From Scratch

Each worktree has its own build directory. The first compilation will be slow. After that, each worktree maintains its independent cache, which is precisely the advantage over classic git checkout (which invalidates the cache every time you switch branches).

3. Local Untracked Files

Your .env.local, editor configurations, files not in git... don't get copied to the new worktree. You'll need to recreate them or make symlinks.

4. Apps with Shared Disk State

If your app writes data to ~/Library/Application Support/ or similar, two app instances from different worktrees will compete for the same file. This isn't a worktree problem, it's a problem of running two instances of the same app. Solution: don't run two simultaneously, or parameterize the data directory per build.

5. Don't Delete the Directory Manually

If you rm -rf the worktree instead of using git worktree remove, git still thinks the branch is occupied. Run git worktree prune to clean up orphaned references.

6. The Remote Knows Nothing

Worktrees are 100% local. Gitea, GitHub, GitLab... no remote knows they exist. They only see normal git push commands with normal branches. It's like asking if your server has problems with you using Vim or VS Code: it doesn't know, it doesn't care.

Best Practices

Naming convention: Put worktrees as siblings of the original repo, with a descriptive suffix:

~/code/myapp/                    ← main
~/code/myapp-sparkline/          ← feature
~/code/myapp-notifications/      ← feature
~/code/myapp-hotfix-login/       ← hotfix

This way ls ~/code/myapp* shows you everything at a glance.

One worktree per feature, not per whim: Create worktrees for work that will actually be parallel. If you're going to do things sequentially, a normal branch with checkout is sufficient.

Clean up when done: Abandoned worktrees are like branches nobody deletes — they accumulate and confuse. git worktree list is your friend.

Don't edit the same file from two worktrees: Technically you can, each has its copy. But if both modify the same file, you'll have conflicts when merging. Try to have features touch different areas of the code.

Complete Workflow Proposal

For those who want an organized workflow, here's what I use:

# 1. Create worktrees for sprint features
cd ~/code/myapp
git worktree add ../myapp-feat-a -b feature/feat-a
git worktree add ../myapp-feat-b -b feature/feat-b

# 2. Launch an agent in each
cd ~/code/myapp-feat-a && claude    # terminal 1
cd ~/code/myapp-feat-b && claude    # terminal 2

# 3. Merge as they finish
cd ~/code/myapp-feat-a
git push -u origin feature/feat-a   # create PR

# 4. Clean up what's already merged
git worktree remove ../myapp-feat-a
git branch -d feature/feat-a

# 5. See what's still active
git worktree list

The cycle is: create → work in parallel → push/PR → merge → clean up. Each worktree lives as long as the feature, no more, no less.

Closing Thoughts

Worktrees have been in git since version 2.5 (July 2015). More than ten years. And most people still do git stash like we're in 2010.

With the arrival of coding agents, the bottleneck is no longer the speed at which you write code — it's the speed at which you can context switch. And worktrees eliminate that context switch completely: you don't switch branches, you switch directories. cd instead of checkout.

Which is, ultimately, what we should have been doing all along.

TL;DR: `git worktree add ../name -b branch` creates a second working directory on the same repo. No copies, no stash, no invalidating caches. Perfect for having multiple coding agents working in parallel. Clean up with `git worktree remove` when done.

This article was originally written in Spanish and translated with the help of AI.

I'm paying $15 per million tokens to write 'fix: typo'

Fernando Rodriguez — Thu, 30 Apr 2026 15:59:32 +0000

Yesterday I wrote a commit message with Claude Code. The diff was a one-line change: a typo in a comment. Claude Opus read the diff, thought for two seconds, and generated fix: correct typo in auth comment. That consumed about 800 input tokens and 30 output tokens, at $15 and $75 per million respectively. Cost: a fraction of a cent. But multiply that by 40 commits per day, 250 days per year, across a company with 200 developers using coding agents, and the fraction of a cent becomes thousands of dollars spent on the intellectual equivalent of applying band-aids.

The problem isn't that Opus is expensive. The problem is that coding agents don't distinguish between $0.001 tasks and $0.10 tasks. Everything goes through the same model. Generate a commit message, classify an issue, validate a format -- everything hits the big model at the same cost as designing a microservices architecture. It's the equivalent of hiring a surgeon to apply band-aids.

The numbers

Let's run the numbers with Claude Opus 4 pricing (the previous generation, which most teams still use in production):

Task	Input tokens	Output tokens	Cost
Commit message (small diff)	~800	~30	$0.014
Classify an issue	~500	~50	$0.011
Validate commit format	~300	~20	$0.006
Standup summary	~2000	~200	$0.045

None of these tasks need a model with 2 trillion parameters and multi-step reasoning capability. They're classification and constrained generation tasks. The equivalent of sorting cards by color.

With Apple Intelligence's on-device model (3B parameters, included in macOS 26): cost $0.00, latency ~300ms, no network, no API key.

foundation-hooks

foundation-hooks is a set of 4 Swift binaries that use Apple's Foundation Models framework to automate development tasks that don't justify a cloud model:

Binary	Function	Git hook
`fm-commit-msg`	Generates conventional commit messages from diff	`prepare-commit-msg`
`fm-validate-msg`	Validates format and suggests corrections	`commit-msg`
`fm-lql-create`	Classifies and creates Linear issues via lql	CLI
`fm-lql-standup`	Generates standup summary from git log + issues	CLI

All four share the same pattern: define a Swift struct with @Generable, feed the model minimal context, get structured output in milliseconds.

Installation:

git clone https://github.com/frr/foundation-hooks
cd foundation-hooks
make build && make install-hooks REPO=/path/to/your/repo

From that point on, every git commit automatically generates a conventional message. The hook has been installed in 11 production repositories for two weeks.

How it works: @Generable and constrained decoding

This is the part that deserves technical attention. @Generable isn't "ask the model to return JSON and hope for the best". It's constrained decoding -- the model literally cannot generate tokens that violate the schema.

The mechanism

@Generable is a Swift macro that generates a JSON Schema at compile time from the struct.
The framework injects that schema into the prompt as a response format specification.
During inference, at each decoding step, token masking is applied: vocabulary tokens that would produce invalid output according to the schema are masked (probability 0 in the softmax).
The model can only choose from valid tokens.

Apple describes this as "guided generation" in the WWDC25 documentation. It's the same technique OpenAI uses with response_format: json_schema and Anthropic applies in tool use. The difference: Apple integrates it into Swift's type system. Define the struct, the compiler generates the schema, the runtime applies it during inference. Type safety end-to-end.

Three levels of constraint

@Generable
struct CommitMessage {
    // Level 1: HARD constraint — effective enum
    // Active token masking: only "fix", "feat", "refactor", etc.
    // Tokens that would form "bug" or "update" have probability 0.
    @Guide(.anyOf(["fix", "feat", "refactor", "test", "docs", "chore", "style"]))
    var type: String

    // Level 2: SOFT constraint — like a system prompt for this field
    // The model tends to follow it but isn't forced to.
    @Guide(description: "Scope of the change, e.g. auth, ui, db. One word, lowercase.")
    var scope: String

    // Level 3: no constraint — free string, the model decides
    var subject: String
}

The analogy: anyOf is a dropdown, description is an input with placeholder, and a field without Guide is an empty textarea. The difference between the three isn't one of degree but of mechanism. The first operates at the token level (the model cannot deviate), the second operates at the prompt level (the model tends to follow it), the third has no guidance.

This is relevant because the git hooks use case is exactly the scenario where hard constraints shine. A commit type must be one of 7 values. No ambiguity, no creativity, no reasoning. It's pure classification. A 3B parameter model with constrained decoding does this as well as a 200B model. The difference is one takes 300ms and is free, the other takes 2 seconds and costs money.

Complete code for a hook

This is fm-commit-msg, the prepare-commit-msg hook. It's 106 lines of Swift with no external dependencies:

import Foundation
import FoundationModels

@Generable
struct CommitMessage {
    @Guide(description: "Type of change")
    @Guide(.anyOf(["fix", "feat", "refactor", "test", "docs", "chore", "style"]))
    var type: String

    @Guide(description: "Scope of the change, e.g. auth, ui, db, api. One word, lowercase.")
    var scope: String

    @Guide(description: "Imperative summary of the change, max 50 chars, lowercase, no period")
    var subject: String
}

guard SystemLanguageModel.default.isAvailable else {
    exit(0) // No Apple Intelligence — exit silently, user writes their own
}

Three things worth highlighting:

Graceful degradation: if Apple Intelligence isn't available (Mac without Apple Silicon, model not downloaded), the hook exits with code 0 and git continues normally. Never blocks.
Doesn't fabricate: the model receives git diff --cached --stat and a patch truncated to 3000 characters. Enough to classify and summarize, insufficient to confabulate.
Doesn't replace the human: the message is written to the commit file with git comments (#), so git commit displays it in the editor. The user can modify or discard it.

The generation:

let session = LanguageModelSession(instructions: """
    You generate git commit messages in conventional commits format.
    Focus on WHY the change was made, not WHAT changed.
    The subject must be imperative mood, lowercase, no period, max 50 chars.
    """)

let result = try await session.respond(to: prompt, generating: CommitMessage.self)
let msg = result.content
let message = "\(msg.type)(\(msg.scope)): \(msg.subject)"

session.respond(to:generating:) returns a CommitMessage instance, not a String. No parsing. No regex. No try? JSONDecoder().decode(...). The struct is the contract and the compiler guarantees it.

Issue tracking integration: fm-lql-create

The same pattern works for issue tracking. fm-lql-create classifies a natural language description and creates a Linear issue via lql, a Linear CLI written in Rust:

@Generable
struct IssueClassification {
    @Guide(.anyOf(["bug", "feature", "improvement", "task", "chore"]))
    var type: String

    @Guide(.anyOf(["urgent", "high", "medium", "low", "none"]))
    var priority: String

    @Guide(description: "Clean, professional issue title. Max 80 chars.")
    var title: String

    @Guide(description: "One-line description for the issue body")
    var description: String
}

Usage:

$ fm-lql-create "auth token refresh crashes when expired"
PROD | high | bug | TOK: Auth: token refresh crashes on expiry
Token refresh fails silently when the OAuth token has expired, causing auth loop.

Press Enter to create, Ctrl-C to cancel:

The local model classifies the issue in ~500ms: type bug, priority high, clean title, one-line description. Then lql create creates it in Linear. The --dry-run flag shows the proposal without executing anything.

Two fields with anyOf (type, priority) guarantee the classification is valid. It cannot return "priority: very important" or "type: bugfix". The tokens are masked. Two fields with description (title, description) give controlled freedom to the model.

Before and after

Step	With coding agent (Opus)	With foundation-hooks
Generate commit message	~2s, ~800 tokens, ~$0.014	~300ms, 0 tokens, $0.00
Validate format	~1.5s, ~300 tokens, ~$0.006	~200ms, 0 tokens, $0.00
Classify issue	~2s, ~500 tokens, ~$0.011	~500ms, 0 tokens, $0.00
Generate standup	~3s, ~2000 tokens, ~$0.045	~800ms, 0 tokens, $0.00
Requires network	Yes	No
Requires API key	Yes	No
Works on airplane	No	Yes

The local model times are actual measurements on a MacBook Pro M4 Pro. Not synthetic benchmarks.

What it can't do

Apple's on-device model is a 3B parameter model with a 4096-token context window. It has clear limits:

Large diffs: above ~3000 characters of patch, the context is truncated. For massive refactors touching 20 files, the model only sees the statistical summary (--stat), not the complete patch. The commit message will be generic but correct in format.
Architectural decisions: "Should I use a protocol or a concrete type here?" is a question that needs project context, codebase history, and multi-step reasoning. That's still big model territory.
Code generation: foundation-hooks doesn't generate code. It generates metadata about code: commit messages, classifications, summaries. The boundary is clear: if the task is to "write" something a human will review, use the big model. If the task is to "label" something a human already wrote, use the local model.
macOS 26+ with Apple Silicon only: doesn't work on Linux, doesn't work on Intel Macs. For heterogeneous teams, the hook exits silently and the user writes their own message.

Installation

# Prerequisites: macOS 26, Xcode 26, Apple Intelligence enabled
git clone https://github.com/frr/foundation-hooks
cd foundation-hooks
make build

# Install hooks in a specific repo
make install-hooks REPO=/path/to/your/repo

# Install CLI binaries to ~/.local/bin
make install-lql

# Install hooks in all known repos (edit Makefile to adjust the list)
make install-all

The Makefile copies the compiled binaries directly to .git/hooks/. No runtime, no daemon, no configuration. If the binary is in the hook, it works. If you don't want AI on a commit, git commit --no-verify.

The thesis

Coding agents are extraordinary tools for tasks requiring complex reasoning. But the current pricing model doesn't distinguish between complexity. Every interaction with the model -- from designing an architecture to writing "fix: typo" -- goes through the same pipeline, at the same cost, with the same latency.

The solution isn't to stop using coding agents. It's to stop using them for everything. Classification, validation, and constrained generation tasks are solvable with a 3B parameter model running locally. The hardware is already in your machine. The framework is already in the operating system. Only the code to connect them was missing.

foundation-hooks is 400 lines of Swift connecting those dots. make install-hooks REPO=. and every commit generates its own message, every issue classifies itself, every standup writes itself in 800ms. No network, no tokens, no cost.

The surgeon can stop applying band-aids.

diy-codex-automations-claude-code-systemd

Fernando Rodriguez — Thu, 30 Apr 2026 15:56:29 +0000

title: "DIY Codex Automations: Nocturnal Agents with Claude Code and systemd"
date: 2026-03-11T20:00:00+01:00
draft: false
slug: "diy-codex-automations-claude-code-systemd"
description: "A practical tutorial to replicate OpenAI's Codex Automations using Claude Code, systemd timers, and Gitea. Agents that work while you sleep, without relying on any desktop app."
tags: ["claude-code", "automation", "systemd", "gitea", "openai", "codex", "tutorial"]
categories: ["tutorial"]

translation:
hash: ""
last_translated: ""
notes: |
- "ñapa": means "hack/kludge/bodge". Quick and dirty fix. Not derogatory.
- "chapuza": same as "ñapa" — a hacky solution. Translate as "kludge" or "bodge".
- "dicho en cristiano": "in plain language". No religious connotation intended.
- "currar": colloquial Spanish for "to work". Translate as "work" or "grind".
- "barra del bar": "bar counter" — casual conversation metaphor.
- "madrugón": waking up very early. Not a standard English concept — "early morning" works.
- "irse por las ramas": "to go off on a tangent" / "to beat around the bush".

- "otro gallo cantaría": "things would be different" / "it would be a different story".

Two weeks ago, OpenAI introduced Codex Automations. The idea: define a trigger (a cron job, a push, a new issue), write instructions in natural language, and an agent runs it solo in an isolated worktree. No human intervention. While you sleep, the agent triages issues, summarizes CI failures, generates release briefs, and even improves its own instructions.

Sounds like magic, right? And it is, a little. But there’s one catch they didn’t emphasize too much in the keynote: you need the Codex App running on your desktop. macOS or Windows only. No headless servers. No running it on a mini PC and forgetting about it.

And that’s when I thought: “Wait. I already have this.”

The pieces you already have

If you’re using Claude Code, you already have 90% of the infrastructure. claude --print executes a prompt without an interactive session. You give it instructions; it gives you a result and shuts down. No GUI. No open terminal. Perfect for a cron job.

If you have a server that’s always on (a mini PC, Raspberry Pi, or a $5 VPS), you’ve got the scheduler. systemd or cron, whichever you prefer, has been working away in the background for decades while you sleep.

And if you use Gitea, GitHub, or any forge with an API, you already have a place to deposit the results: comments on PRs, new issues, or committed files.

Plainly put: Codex Automations is a pattern. Not a product. And that pattern is old news.

┌─────────────────────────────────────────────┐
│           systemd timer (every N hours)      │
│                     │                        │
│                     ▼                        │
│           bash/fish script                   │
│              │                               │
│              ├── git pull --ff-only           │
│              ├── claude --print "prompt"      │
│              ├── parse results                │
│              ├── notify (Telegram/email)      │
│              └── git push (if changes)        │
└─────────────────────────────────────────────┘

Anatomy of an Automation

All automations follow the same structure. A script that:

Updates the repo (git pull)
Executes Claude Code in non-interactive mode
Does something with the results
Notifies and/or commits changes

Let’s build the first one. After that, the rest are just variations of the same theme.

The Base Script

#!/usr/bin/env bash
set -euo pipefail

REPO_DIR="/srv/social-publisher"
LOG_DIR="/var/log/automations"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

cd "$REPO_DIR"
git pull --ff-only

RESULT=$(claude --print \
  --model sonnet \
  --max-turns 3 \
  "$1")  # The prompt comes as an argument

echo "$RESULT" > "$LOG_DIR/$TIMESTAMP.md"

That’s it. The skeleton fits into 12 lines. The rest is about deciding which prompt to pass and what to do with $RESULT.

The systemd Timer

# /etc/systemd/system/claude-automation.timer
[Unit]
Description=Claude Code automation

[Timer]
OnCalendar=*-*-* 03:00:00
Persistent=true

[Install]
WantedBy=timers.target

# /etc/systemd/system/claude-automation.service
[Unit]
Description=Claude Code automation runner

[Service]
Type=oneshot
User=claude-runner
ExecStart=/opt/automations/review-prs.sh
Environment=ANTHROPIC_API_KEY=<your-key>

sudo systemctl enable --now claude-automation.timer

At 3 a.m., systemd kicks off the script. Claude analyzes whatever you ask it to and deposits the result. You find out in the morning.

Example 1: Automatic PR Review

This is the most useful one. Every time there’s an open PR, Claude reviews it and leaves a comment.

Using a webhook is more elegant, but a cron job every 30 minutes works just as well for small teams:

#!/usr/bin/env bash
set -euo pipefail

GITEA_URL="https://git.example.com"
GITEA_TOKEN="$(op read 'op://DEV/Gitea/token')"
REPO="myorg/myrepo"

# Get open PRs
PRS=$(curl -s -H "Authorization: token $GITEA_TOKEN" \
  "$GITEA_URL/api/v1/repos/$REPO/pulls?state=open" \
  | jq -r '.[].number')

for PR_NUM in $PRS; do
  # Get the diff
  DIFF=$(curl -s -H "Authorization: token $GITEA_TOKEN" \
    "$GITEA_URL/api/v1/repos/$REPO/pulls/$PR_NUM" \
    -H "Accept: application/diff")

  # Claude reviews the diff
  REVIEW=$(claude --print \
    --model sonnet \
    --max-turns 1 \
    "Review this PR diff. Flag potential bugs, \
     security issues, and specific areas for improvement. Be concise. \
     Do not repeat the code; highlight issues with their line.

     $DIFF")

  # Post as a comment
  curl -s -X POST \
    -H "Authorization: token $GITEA_TOKEN" \
    -H "Content-Type: application/json" \
    "$GITEA_URL/api/v1/repos/$REPO/pulls/$PR_NUM/comments" \
    -d "{\"body\": \"## Automated Review\\n\\n$REVIEW\"}"
done

Each morning when you open Gitea, every PR has a comment with feedback. It doesn’t replace a human review, but it filters out the obvious: typos, unused imports, an if without an else that smells like a bug.

[rest of the examples and entire blog follow translated...]

In Codex, a Skill Is Not a /Command (but in Claude Code, It Almost Is)

Fernando Rodriguez — Thu, 30 Apr 2026 15:54:27 +0000

TL;DR: If you're using Codex, use a command to control the session or application, and use a skill to teach the agent a way of working. In Claude Code, the current documentation already treats skills as something you can invoke with /skill-name, so the concepts merge more there. Not so in Codex: types might exist as a skill, but /types won't exist by default.

There's a common confusion when switching from Claude Code to Codex. And it's understandable.

You create a skill called types, go back to the terminal, type /types all confident... and Codex looks at you like you just walked into a hardware store and ordered a latte.

The problem isn't that your skill is broken. The problem is that in Codex, a skill and a command are not the same thing.

And here's the kicker: this distinction is not just cosmetic. It changes how you design your workflows.

A Simple Analogy to Make It Clear

Think of Codex as a plane with two levels.

The first level is the cockpit: buttons, levers, indicators. That's where commands live. They control the session, the client, or the tool. It's operational control.

The second level is the copilot's manual: procedures, guidelines, checklists, avoidable pitfalls. That's where skills live. They change how the agent thinks when performing a task.

Put simply:

A command affects the cockpit.
A skill affects the copilot's head.

If you try to use the manual as if it were a button, that doesn’t fly.

What Is a Command in Codex?

In Codex, commands come in two flavors that shouldn’t be mixed up.

The first type is CLI commands:

codex login
codex exec "run tests and fix failures"
codex resume --last
codex apply
---

No mystery here. These are application operations. Authenticating, running a task, resuming a session, applying a diff. If you removed the model tomorrow, these commands would still make sense.

The second type is **slash commands in an interactive session**:

text
/model
/permissions
/personality
/agent
/status


These aren’t "fancy prompts" either. They’re live session controls. They change the model, permissions, personality, active thread, or visible state. They’re cockpit buttons.

OpenAI, in fact, documents it this clearly: there's one dedicated page for **slash commands** "to control Codex during interactive sessions," and another distinct page for **skills**, defining them as the authoring format for *reusable workflows*.

That's why these are commands and not skills: they require predictable, immediate behavior with stable semantics. You don't want the model to "creatively interpret" what `/permissions` means. You want it to change permissions. Period.

## What Is a Skill in Codex?

A skill in Codex is something else entirely. It’s a reusable workflow that teaches the agent **when** to apply an approach, **how** to think about a task, and **which steps** to follow.

And here’s another fine but important nuance: OpenAI says a skill is the authoring format, whereas the **plugin** is the installable or distributable unit. In other words, you first design the workflow as a skill; if you want to share or package it later, you wrap it up.

Clear examples:

text
$types
$improve
$owasp
$blog


Or, if you prefer natural language:

text
use types to audit this repo
use improve to review this diff


Here, you're not telling Codex, "Change a setting." You're saying, "When you do this task, follow this playbook."

For example, my `types` skill shouldn’t be a button. It needs to read the project, detect the language, inspect models, look for stringly-typed code, decide if an `Optional` is being used correctly or if it's modeling a domain state. That requires context and judgment. That’s exactly the type of work a skill is designed to handle.

For the same reason, `improve` makes sense as a skill: reviewing a diff isn’t a deterministic action. It’s a specific way to approach code review.

## Why It Feels Like “The Same Thing” in Claude Code

Here's the mental trap.

The current Claude Code documentation isn’t shy about this. It talks about **skills** and tells you that you can invoke them directly with:

text
/skill-name


In Claude Code, a significant part of what you perceive as a "reusable workflow" enters through a slash command syntax. The UX blends two concepts that are separate in Codex:

- Reusing a workflow
- Invoking it with `/something`

Additionally, Claude Code retains its **built-in commands** separately:

text
/help
/compact


And it even separates yet another piece: **subagents**, which are specialized assistants with their own context, permissions, and system prompt.

In other words:

- In **Claude Code**, skills, subagents, and commands coexist, but skills can be invoked with `/`.
- In **Codex**, reusable workflows live as skills, and `/commands` are reserved for explicit session control.

That’s why, coming from Claude, your brain quickly learns a practical equivalence: "If something reusable exists, I’ll probably trigger it with `/something`." In Codex, this mental shortcut stops working.

## Concrete Examples: What Should Be a Skill vs. a Command?

### Things That Should Be a Skill in Codex

**`types`**

Because you’re not “triggering an action.” You want to apply type design principles on a real codebase.

**`improve`**

Because reviewing a diff isn’t a mechanical operation. It involves judgment, context, and priorities.

**`blog`**

Because writing an article with tone, structure, and fact-checking is a reasoning flow, not a button.

**`owasp`**

Because a security audit needs to adapt heuristics to the stack, repo, and specific risks.

### Things That Should Be a Command in Codex

**`codex login`**

There’s nothing to reason about. You either authenticate or you don’t.

**`/model`**

Switching models is a client operation. Not a work criterion.

**`/permissions`**

Tweaking permissions mid-session is pure operational control.

**`codex resume --last`**

Reopening a session isn’t cognitive workflow. It’s an app action.

## The Trickiest Case: Hybrid Tasks

There’s an intermediate category that can trip you up at first: workflows you’d like to launch with a convenient syntax, but whose logic is still skill-based.

For example:

- You’d like to write `/types`
- But conceptually, `types` is still a skill

The elegant solution here isn’t "turn the skill into something else." The solution is to wrap it.

That means:

1. Keep the intelligence in the skill.
2. Create a plugin or command to invoke it with slash-command ergonomics.

This way, you get the best of both worlds: command UX, skill brains.

## The Golden Rule in Codex

When deciding between a command and a skill, use this test:

**Do you want to change the session or app state?**

Then you need a **command**.

**Do you want to change how the agent approaches a task?**

Then you need a **skill**.

Here’s a handy table:

| I want to...                     | Use in Codex... | Example             |
|----------------------------------|-----------------|---------------------|
| change permissions               | `command`       | `/permissions`      |
| switch models                    | `command`       | `/model`            |
| resume a session                 | `command`       | `codex resume --last` |
| apply an auditing criterion      | `skill`         | `$types`            |
| review a diff with a methodology | `skill`         | `$improve`          |
| draft with an editorial guide    | `skill`         | `$blog`             |

## So, Which Should You Use?

The short answer: **in Codex, use skills for reusable knowledge and commands for operational control**.

If you’re coming from Claude Code, your first instinct will be to turn every reusable workflow into `/something`. That’s an understandable habit because Claude's documentation encourages that thinking. But in Codex, that habit will get you stuck fast.

First, design the **skill**. If you later need more ergonomic input, wrap it in a plugin or command. Not the other way around.

Because if you start with the button before you’re clear on the procedure, you’ll end up with a pretty interface that doesn’t do much. And we’ve already got too many of those in this industry.

Here’s the takeaway: in Claude Code, a *skill* can come through the `/slash-command` door. In Codex, it can’t. And honestly, that’s probably a good thing.

Once you understand this difference, you’ll stop fighting with `/types` and start building workflows that actually fit the tool. Progress!