Errata Hunter

Posted on Apr 23 • Edited on Apr 26 • Originally published at reversetobuild.com

How to Actually Use AI Coding Agents — 6 Skill-Specific Tips

#ai #claude #hallucination #zephyr

TL;DR

The real problem with AI coding is not model quality but which stage you hand what to — a nonexistent CONFIG_SPI_NRFX_SPIM3 in Zephyr passing the build and then bricking boot is the proof.

The fix is not a loop but gates — split the work into Research → Fact-Check → Plan → Fact-Check → Implement → Debug → Review, and run Fact-Check twice in independent sessions, right after research and right after planning.

Separate each gate into skill (instruction) and hook (contract) — put deterministic checks like banning any/void* into hooks such as auto-typecheck.sh, and keep personas and prompt patterns in SKILL.md. The whole pipeline then compounds.

1. Where Vibe Coding Breaks

Last year I was adding one more BLE sensor node to a Zephyr-based nRF52 firmware. I threw a one-liner at Claude Code — "enable the SPI driver" — and a single clean line landed in prj.conf: CONFIG_SPI_NRFX_SPIM3=y. The build passed. The binary flashed. The board would not boot. Thirty minutes of digging later, it hit me — that symbol does not exist anywhere in Zephyr. The correct answer on this chip family is CONFIG_SPI_NRFX_SPIM plus a Devicetree node activation. The symbol the AI had synthesized was silently dropped by the Kconfig parser with a single "unknown symbol, ignoring" warning, buried somewhere in 800 lines of build log.

Simon Willison wrote in March 2025 that "hallucinations in code are the least dangerous form of LLM mistakes." The reasoning is clean — you run the code and the error yells at you. Call a method that doesn't exist and the stack trace shouts, then paste it back into the agent. Done. Willison made this as a general claim, not restricted to a language or domain. It holds on the web. It holds in a Python REPL. It did not hold in my firmware. The moment the assumption "run it and the error surfaces" collapses, the entire run-and-detect feedback loop loses its meaning. The compile passed, the build passed, the binary flashed, and the board quietly turned into a brick. Willison's optimism did not protect me there.

I did not want to file this under "Claude Code isn't smart enough yet." I tried the same prompt against GPT-5 and Gemini and got similar results. The problem was not AI quality but where I had placed the AI in my process. I was expecting "verified output" from a generation stage. Generation is the stage where hallucination is natural; verification has to happen somewhere else. The empty seat was not the AI's to fill — it was mine.

Through my time using AI coding agents, I translated that lesson from code into the shape of a pipeline. Pieces I'd built at different moments — a Kconfig verification hook, a gate-based workflow, an HIL CI feedback loop — only in hindsight did I see they were all answering the same question: at which stage, with what verification, do I hand work to the AI? This essay is my current answer. Six skills, six gates, and the failures and trade-offs I hit at each seat.

2. Six Skills as a Frame — Gates, Not a Loop

Many AI coding guides talk about a "loop" — research, plan, implement, review, back to research. Circles are pretty but they did not match my experience. Circles have nothing to pass through. I started seeing the process as gates instead. Each stage has a pass condition — "don't verify this and the next stage gets poisoned" — and the mechanism that holds that condition is its own thing.

Here is the shape of my pipeline.

Research → Fact-Check → Plan → Fact-Check → Implement → Debug → Review. Why Fact-Check sits twice is explained in Section 2

The odd part here is that Fact-Check appears twice — once right after Research, once right after Plan. I also thought "once is enough" at first. Then I ran into a pattern several times: the research was collected cleanly, but the assumptions the AI added during planning were wrong. Implicit premises like "this library supports that platform" or "this API already exists in v2.4." These were not facts from research but new claims the planner introduced, and they needed a separate teardown.

Addy Osmani's 2026 workflow has five stages. The Claude Code docs use four: Explore → Plan → Code. Cursor's best practices — interestingly — barely use the word "hallucination" in the body and instead say "AI-generated code can look right while being subtly wrong." All three see the same phenomenon. The difference is the number of gates, and the number of gates scales with the feedback latency of the domain. Web and scripting get run-to-error feedback in seconds, so two or three gates are enough. Firmware sits with tens of minutes between compile and boot, and weeks between boot and "no intermittent bug." You need more gates — and not just more, but different kinds.

That is why I call them gates rather than loops. A loop is a question of how many times you go around; a gate is a question of where you put what. The latter is system design, the former is operational feel. This essay sits on the system-design side.

What the AI is good at and bad at also shifts per stage. At Research the AI is an excellent "keyword expander" and a terrible fact checker. At Debug it flips — the AI is an excellent log reader, and here I actually get better results by stepping back. Splitting the work into six skills is how I avoid losing that role inversion. Lump them together and everything collapses into "the AI just isn't great."

3. Skill 1 — Research: Excellent Assistant, Terrible Fact Checker

At the research stage the AI is especially good at three things. Keyword expansion (say "BLE 5.3 periodic advertising" and fifteen adjacent terms come out). Comparison tables (current draw, RX sensitivity, BOM cost across chip A/B/C). Document summarization (the two paragraphs I want from a 60-page datasheet). Use those three well and you research two to three times faster than alone.

The trouble starts right after. The AI cannot judge source credibility, cannot guarantee recency, and cannot verify domain-specific accuracy. I once trusted an AI summary over the datasheet and reversed a register bit order — a bit that had flipped between chip revisions A and B. The AI confidently served the revision-A answer. Half a day gone to debugging. The problem was not that the summary was wrong; the problem was that I had not built a way to check whether the summary was wrong.

So my Research stage now carries three hard-coded rules. First, source tagging. Every entry in the findings file gets labeled as [official], [community], or [AI inference]. That one-word tag decides the "what to doubt first" order at the Fact-Check stage. Second, concrete query design. "Find me BLE OTA docs" is a bad prompt; "official docs, release notes, and ncs-* tag commit messages for Nordic nRF Connect SDK 2.5's MCUboot swap algorithm" is a good one. The latter forces the AI to choose where to look. Third, persistence to .md. Research output always accumulates in one findings.md. Sessions can drop, context can compact — the information survives and flows cleanly into the next stage.

If you freeze it as a file — if you decide to embed the Research stage as a skill, pin four things into the skill text. ① An interface that takes "request (topic, scope)" as an argument. ② An output that appends to a single markdown file rather than overwriting. ③ A directive at the top: "analyze deeply and record the details thoroughly" (that single sentence roughly doubles or triples the perceived summary depth). ④ The rule that matters most — do not modify any file other than the one this skill writes. Miss the fourth and the day comes when the AI says "while I was at it, I also fixed main.c." Unverified edits slip into the research stage, and Fact-Check ends up breaking already-polluted input. Practically, scope tool permissions to a single write path: allowed-tools: Read, Grep, Glob, WebFetch, Write(findings.md). The Claude Code skills docs recommend exactly this shape.

4. Skill 2 — Fact-Check: Break the Plan Before You Ship It

Fact-Check will be the strangest-sounding section here. Most guides do not place a dedicated verification stage between research and implementation. I place two. One right after research, one right after planning. To explain why, I need to start with a pattern I kept hitting.

If you have ever told an AI, inside the same session, "find what's wrong with what you just researched," you know how subtly disappointing the result is. The AI leans toward confirming its own answer. It will fix typos and small wording, but the big claims — "this chip supports that feature" — usually survive. I first blamed the model. Then I saw Anthropic's automated red teaming work from 2024 and changed my mind. One model generates attacks and a different model defends. The match does not exist inside a single model, a single session. The industry had already converged on "only verification in an independent session counts." Addy Osmani calls this "secondary AI sessions to critique primary outputs." The Claude Code docs recommend a Writer/Reviewer pattern and explain it in one line: "A fresh context improves code review since Claude won't be biased toward code it just wrote." That bias is exactly the subtle disappointment I kept feeling.

So my Fact-Check skill does four things.

First, it forces an independent session. The researching session and the fact-checking session do not share context. The skill takes only a file path as input and reads from there as if seeing the document for the first time. I deliberately build the setup of handing a paper to someone who does not know the answer.

Second, it hunts "things that don't exist" deterministically. In my experience this is the most dangerous family of hallucinations. The principle behind the Kconfig-symbol verification hook is simple — extract every symbol mentioned in the research or plan and grep the actual Kconfig tree to confirm each one. Present → pass; absent → stamp [TBD: needs fact-check] and report. What matters is that it is a deterministic file-existence check, not a probabilistic AI judgment. You wrap nondeterministic generation in deterministic verification — that is exactly what the word "gate" means here. Recently an academic version of the same idea appeared. arXiv 2509.09970 validates GPT-4-generated FreeRTOS firmware in QEMU, categorizes faults into buffer overflow (CWE-120), race condition (CWE-362), and DoS (CWE-400), runs fuzzing, static analysis, and runtime checks through a three-stage agent loop, and reports a 92.4% Vulnerability Remediation Rate and a 37.3% improvement margin. The numbers are tied to that paper's sample, but the design principle — close generation's nondeterminism with verification's determinism — is the same as my hook's.

Third, it embeds a Red Team prompt. The second Fact-Check, right after planning, centers on logical weaknesses rather than cross-referencing official docs. I pin a single line into the skill:

You are a senior engineer. Find the three weakest links in the plan below,
and for each one describe a concrete failure scenario and the moment it fails.
"It'll probably be fine" counts as one of the three failures.

The last line matters more than it looks. Without pinning the Red Team role, the AI wraps up with "a mostly solid plan."

Fourth, it never modifies the source. Fact-Check output does not touch findings.md or the plan file; it writes to a separate report file. The moment a verifier edits the verification target, the verifier becomes a new source of contamination. This has to be enforced by structure, not by discipline — in Claude Code, scope allowed-tools to Read, Grep, WebFetch, Write(fact-check-report.md). Write permission opens for the report file only.

If you freeze it as a file — input is the path of the document to verify, output is a single report file. The four items to pin are exactly the four paragraphs above. One addition: Fact-Check is a skill that should have no side effects, so setting disable-model-invocation: true and only running it on explicit invocation is the safer default. The Claude Code skill system exposes that flag for exactly this use case.

5. Skill 3 — Plan: Draft It Twice, Break It Once

Plan Mode is not universal. The Claude Code docs admit this plainly: "If you could describe the diff in one sentence, skip the plan." Turning on Plan Mode for a one-sentence diff costs more than it returns. My Plan stage does not run every time — it runs when I feel "my mental model and the AI's mental model might be misaligned before I touch code." The heuristic, from experience, is roughly: if more than two files are affected, or if I am touching a system I know less well, I always run Plan.

When Plan runs, I draft twice. The first draft is the AI's; the second is the AI revising based on my annotations. At least two rounds. This is closest to Osmani's "waterfall in 15 minutes" analogy, and what matters is that the two rounds serve different goals. Round one enumerates the full list of steps. Round two attaches the trade-offs round one missed. It is not doing the same thing twice.

Round two carries one extra prompt — "where will this plan fail first?" It is the same family as the Red Team in Fact-Check #2, but the timing differs. Throwing one self-destructive question before the plan hardens turns the answer into a "risks" section that then acts as a warning light throughout implementation. Remember that the reader of the plan is future-you, and its usefulness goes up.

I pin four required elements to every plan. ① Approach detail — why this order, why not another. ② Before/after code snippets — not "refactor this part," but the actual shape of the change. ③ Exact file paths — no "modify the relevant files." ④ Explicit trade-offs — chosen approach, alternatives, reasons for rejection. Miss these four and the plan does not reach the level of "Claude can implement this right now." Most of the mid-implementation "what should I do here?" interruptions come from vague plans.

If you freeze it as a file — input is the path of a research file (or findings.md), output is a single plan document with checkboxes (- [ ]). The five things to pin into the skill: ① read the research file first (without this the AI plans from memory), ② justify the chosen order, ③ before/after code snippets, ④ exact file paths, ⑤ trade-offs. For items four and five, leaving empty slots in the skill's example template builds the habit of filling them in.

6. Skill 4 — Implement: Structured Context Cuts Hallucinations

The one thing I learned at Implement is the equation context = answer key. The AI's probability of producing the right answer is far more sensitive to the quality of the context you attach than to the "instructions" in your prompt. When I had the AI write a Devicetree overlay, I bundled the board file, the target node's DTS, and the binding YAML together via @ references — and a task that had failed three times passed on the first try. I recently saw the same observation independently at reversetobuild.com. That blog recommends the exact same pattern: inject @boards/arm/nrf52840dk_nrf52840.dts, @zephyr/dts/arm/nordic/nrf52840.dtsi, and @zephyr/dts/bindings/spi/spi-device.yaml together. Reaching the same conclusion independently tells me this is not a personal trick but a structural requirement of the domain.

Opposite structured context sits incremental implementation. Generating 200 lines at once is almost always worse than generating 20 lines ten times. The reason is simple — every 20 lines the build runs, the type checker runs, the linter runs. Errors surface immediately, and the AI generates the next 20 lines in a slightly different context. Generate 200 at once and a single error pulls the whole block down, and the AI loses context about where to start fixing. "Short and frequent" is not just a commit principle; it is a generation principle.

And the most practical rule — no type escapes. TypeScript's any and unknown, Python's Any, Go's interface{}, C/C++'s void*. These are the first exits the AI takes when stuck. When the AI plasters over a spot with any "just to make it run," that spot is exactly where the runtime error lands a few weeks later. My Implement skill text bans these explicitly, but the enforcement is a hook, not a skill. A PostToolUse hook called auto-typecheck.sh runs tsc --noEmit or mypy right after a file edit and, on any any regression or type error, blocks the tool call itself with exit 2. Skill text is persuasion; the hook is the contract. Do not mix the two.

Security code is the exception. Cryptography, authentication, signing, key management — I do not hand these to the AI. You might ask "can't you just review it?" I'll come back to that in the Review section. The short version: the buggy distributions an AI reviewer misses and the security-bug distribution overlap. So I remove them from the generation stage entirely.

If you freeze it as a file — input is a plan file path, output is code plus plan-checkbox updates. Skill items to pin: ① take the plan file as an argument and execute in order, ② mark - [ ] to - [x] on each step to reflect real-time progress in the file, ③ do not stop until all steps are done — no mid-check prompts (without this, the AI asks "continue?" every step), ④ ban any, unknown, interface{}, void*, ⑤ run the language's type checker after every file edit. Note: items ④ and ⑤ are instructed in the skill text, but the enforcement lives in the hook. Section 9 covers this separation directly.

7. Skill 5 — Debug: What the AI Is Best at Is Reading Logs

Here I flip the tone. The last five sections weighted toward "how the AI gets things wrong." Debug is the stage where I get the most out of the AI. Logs are fact data. Compiler error messages, runtime stack traces, panic dumps over serial. These are inputs the AI cannot fabricate — more precisely, inputs it has no need to fabricate — and the room for hallucination shrinks dramatically. At Debug, the share of AI suggestions I accept is higher than at any other stage.

Three tips are enough. First, pass the error message with the related source. Hand over only the error and the AI guesses. Include the 30 lines around the error line, the definitions of functions on the call stack, and the relevant headers, and the guesswork turns into analysis. Second, have the AI write a minimal reproduction. Tell it "write the smallest program that reproduces this bug" and in the process the AI has to make the bug's assumptions explicit. That explicitness often surfaces the root cause. Third, structured log formatting. Emit serial logs as JSON or at least with consistent tags ([BLE], [OTA], [MCUBOOT]) and the AI's pattern matching gets much stronger. The reason I enforced tag formats in the HIL CI story was not for the human reader — it was to make the logs easier for the AI to read.

Here this essay hits its paradox. Do not freeze Debug itself as a skill. I embedded Research, Plan, Fact-Check, Implement, and Review as skill files but deliberately left Debug out. One reason — Debug is inherently reactive. Every incident has a different error class, different related files, different reproduction conditions. Packing that into a single SKILL.md kills flexibility. The AI ends up following a "generalized debug procedure" and misses the specific oddity of this incident. Prompt patterns are better cooked on the spot, per situation.

I did pull input-bundle normalization into its own skill. I call it error-bundle, and it does exactly one thing — packs the error log, the related source files, and the reproduction conditions into a fixed shape and attaches them to the AI's context. The core work (hypothesis, root-cause tracing) stays in ad-hoc prompts; only the repetitive input prep is skillified.

This boundary surfaces a principle running through this entire essay — reactive work and productive work have different skillification returns. Productive work (Research, Plan, Implement, Review, Fact-Check) has fixed I/O, so freezing pays. Reactive work (Debug) varies per incident, so freezing the core hurts. Debug is the clearest illustration of that boundary. Skillifying is not always the answer; it is the answer only when there is a repeating shape.

8. Skill 6 — Review: Have an AI Review AI Code — Don't Trust It

In an HIL CI pipeline I built a double loop where one AI session reviews code written by another. The bugs that loop has caught fall into three types — missing edge cases, style inconsistencies, minor type weaknesses. The bugs it misses all cluster into one category — "code that runs but is bad." Architectural wrong turns, performance bottlenecks, race conditions, security holes. As Cursor put it in one line: "AI-generated code can look right while being subtly wrong." Reviewer AIs share the same weakness, so the subtle wrongness the author missed is the same subtle wrongness the reviewer misses.

So my Review skill is designed around surfacing the areas the AI cannot catch. Three rules.

First, pin a Staff Engineer persona. Personas are the oldest prompt trick in the book, but rewriting them on every call is wasteful. Put it at the top of the skill once and every review gets the "senior lens" automatically. My current persona reads: "You are a 10-year Staff Engineer. Sort production failure scenarios for this code by cost. Address style last." That last line matters — without it the AI starts with the easy wins and runs out of steam by the time it reaches real structural problems.

Second, force an independent session. Same principle as Fact-Check, and it matters even more here. When the author's session reviews its own code, confirmation bias doubles — the AI remembers the logic it just wrote and verifies only inside that logic. The Claude Code docs capture it in one line: "A fresh context improves code review since Claude won't be biased toward code it just wrote." Implementation-wise, split it into a subagent and grant only Read, Grep, Glob. A reviewer that cannot edit code is the only real reviewer.

Third, auto-flag security paths for manual review. When the reviewer generates the report, if any modified file path touches auth/, crypto/, sign/, or token/, the report inserts a top banner: "⚠ SECURITY PATH — AI review is not sufficient." That banner requires a human to read and remove it manually before the pipeline moves on. My time using these agents confirmed that the security-bug distribution does not overlap with the bugs AI is good at, and that confirmation only becomes permanent when I pin it into the skill file. Relying on memory to be careful manually every time fails eventually.

The rule that reviewers do not modify the source is identical to Fact-Check. The review report is a separate file, and only that file is writable.

If you freeze it as a file — input is a diff or list of file paths, output is a review report. Four items to pin: ① Staff Engineer persona at the top of the skill, ② force an independent session (allowed-tools: Read, Grep, Glob + subagent), ③ auto-flag for security paths, ④ no source modification (write permission limited to the report file). Along with Fact-Check, this is the skill with the highest file-freezing return — its I/O shape is fixed, so reuse pays immediately.

9. Three Traps You Hit When Building Many Skills

I have designed the six skills above and covered each one's individual requirements in the "If you freeze it as a file" blocks. But as you stack skills one after another, problems emerge that belong not to any single skill but to the entire skill repository. These three are cross-cutting warnings that do not fit inside any one section above. I stepped on each of them once while using AI coding agents.

1. Naming collisions. Generic names like planner, research, review collide the moment a second pipeline exists. My repo currently has four planners — planner (essay planning), reddit-post-planner, x-thread-planner, and impl-planner (code implementation planning). It started as a single planner. Then I built a Reddit-post pipeline and named that one planner too, and two identically named skills started colliding across contexts. Only after I added domain prefixes (impl-, reddit-post-, x-thread-) did the confusion stop. Prefix by domain from day one. engineering-essay-planner looks excessive when there is only one planner, but the day a second planner appears always comes.

2. Rule-copy debt. Copy-pasting rules like "no any," "persistent type checking," or "do not modify source documents" into multiple SKILL.md files means the day comes when you fix one copy and forget the rest. I once had "do not modify source" copy-pasted into Research, Fact-Check, and Review, and changing one policy meant editing three files. Global rules belong in .claude/rules/ or CLAUDE.md exactly once, and each skill references them. The Claude Code docs flag this sharply: "Bloated CLAUDE.md files cause Claude to ignore your actual instructions." Copy-pasted rules grow length, and growth dilutes the weight of every instruction. Cursor's guidance points the same way: "Add rules only when you notice the agent making the same mistake repeatedly." Rules are added when they earn it, and added rules live in one place.

3. Hook alignment. The third is the subtlest. If a skill is the tool for telling the AI "what to do," a hook is the tool for enforcing "what not to allow." Blur that boundary and both grow weak. My Implement skill text says "run the type checker after every file edit," but that is persuasion. The actual enforcement lives in the PostToolUse hook auto-typecheck.sh, which runs tsc --noEmit whenever a file is edited and, if any error appears, blocks the tool call with exit 2. There is no way for the AI to bypass that block. The Claude Code docs put it in one line: "Unlike CLAUDE.md instructions which are advisory, hooks are deterministic and guarantee the action happens." Instructions are advice; hooks are contracts. The "do not modify the source" rules in Fact-Check and Review work the same way — instruct it in the skill text, but the actual block comes from a PreToolUse hook like protect-docs.sh. Try to make skills and hooks carry the same responsibility and one of them will betray you.

Compressed to one line: domain-prefixed naming / global rules in one place / delegate determinism to hooks. I learned each of these the hard way, one incident each. You just read about them, so maybe you can skip one.

10. The Compounding Six Skills Build

Each of the six pays off on its own. Running the Research skill alone speeds up research; running Fact-Check alone filters out at least one wrong premise. But the real return of this structure comes from the relationships between the skills. A clean Fact-Check gives Plan a clean input; a well-written Plan cuts hallucinations at Implement; a tight Implement cuts load at Debug; a fast Debug lets Review concentrate on structural issues. Improve any one stage by 1.2× and that 1.2 multiplies into the next, until the original task feels two to three times faster. I call this AI compounding. It is a return you cannot get from single-prompt improvements.

The reason every stage needs a differently shaped gate is that every stage has a different failure mode. Research fails on source credibility, Plan fails on implicit premises, Implement fails on missing context, Debug fails on the opposite — context overflow — and Review fails on author bias. They all share the name "gate," but the machines inside are entirely different. One kind of device cannot block five kinds of failure.

Next quarter I want to try two directions. One is automatic skill chaining — Research finishes, Fact-Check fires automatically, and Plan fires only on pass. Today I type /fact-check by hand. The other is expanding hook-based gate automation — today only Implement has a hook attached, and I can see room for deterministic check hooks at Fact-Check and Review. If both land, gates will no longer be something I guard — they become something the system guards. When that moment arrives I'll have another reason to write a follow-up.

Top comments (2)

PEACEBINFLOW • Apr 24

The distinction between "advisory" and "contractual" that you draw with skills versus hooks is the part I keep thinking about, because I've watched myself blur that boundary repeatedly without noticing.

There's a comfort in writing a good SKILL.md. It feels like you've solved the problem. The prose is clear, the instructions are specific, the edge cases are covered. And then the agent ignores it on a Tuesday and you're surprised, even though you shouldn't be—you wrote advice, not a contract. The surprise itself is the sign that you'd mentally upgraded the skill text to something it isn't.

What I think you're describing with the hook layer is basically taking the things you actually cannot tolerate being violated and moving them out of the language model's decision space entirely. The type checker runs because the hook fires, not because the agent remembered. The source files are protected because the hook blocks writes, not because the skill text asked nicely. That's a different category of safety.

It makes me wonder about the stuff that sits in the gap. There are things that aren't quite "crash the build" level of non-negotiable but also aren't just stylistic preferences. The Kconfig symbol verification you described—that's deterministic enough to be a hook, but you've got it as a Fact-Check step the agent runs voluntarily. At some volume, does that migrate into a hook too? Or is there a class of verification that's better left in the agent's hands because it requires interpretation, even if that means occasionally missing?

I guess what I'm asking is: how do you decide what graduates from skill text to hook? Is it purely "can I express this as exit 0/exit 1," or is there a pain threshold where a repeated failure crosses some invisible line?

Errata Hunter • Apr 26

Whenever I start a project, I always begin by setting up the baseline Skills for it. As the project progresses, these Skills naturally evolve and are optimized to fit the specific needs of that project.

To answer your question about the criteria, I definitely lean toward the latter. When the agent repeats the same mistake multiple times and I hit that 'pain threshold'—the exact moment I realize, 'Ah, I have to manually babysit and double-check this every single time'—that is when I carefully promote a rule from a Skill to a Hook.