DEV Community: Errata Hunter

How to Actually Use AI Coding Agents — 6 Skill-Specific Tips

Errata Hunter — Thu, 23 Apr 2026 21:35:11 +0000

TL;DR

The real problem with AI coding is not model quality but which stage you hand what to — a nonexistent CONFIG_SPI_NRFX_SPIM3 in Zephyr passing the build and then bricking boot is the proof.

The fix is not a loop but gates — split the work into Research → Fact-Check → Plan → Fact-Check → Implement → Debug → Review, and run Fact-Check twice in independent sessions, right after research and right after planning.

Separate each gate into skill (instruction) and hook (contract) — put deterministic checks like banning any/void* into hooks such as auto-typecheck.sh, and keep personas and prompt patterns in SKILL.md. The whole pipeline then compounds.

1. Where Vibe Coding Breaks

Last year I was adding one more BLE sensor node to a Zephyr-based nRF52 firmware. I threw a one-liner at Claude Code — "enable the SPI driver" — and a single clean line landed in prj.conf: CONFIG_SPI_NRFX_SPIM3=y. The build passed. The binary flashed. The board would not boot. Thirty minutes of digging later, it hit me — that symbol does not exist anywhere in Zephyr. The correct answer on this chip family is CONFIG_SPI_NRFX_SPIM plus a Devicetree node activation. The symbol the AI had synthesized was silently dropped by the Kconfig parser with a single "unknown symbol, ignoring" warning, buried somewhere in 800 lines of build log.

Simon Willison wrote in March 2025 that "hallucinations in code are the least dangerous form of LLM mistakes." The reasoning is clean — you run the code and the error yells at you. Call a method that doesn't exist and the stack trace shouts, then paste it back into the agent. Done. Willison made this as a general claim, not restricted to a language or domain. It holds on the web. It holds in a Python REPL. It did not hold in my firmware. The moment the assumption "run it and the error surfaces" collapses, the entire run-and-detect feedback loop loses its meaning. The compile passed, the build passed, the binary flashed, and the board quietly turned into a brick. Willison's optimism did not protect me there.

I did not want to file this under "Claude Code isn't smart enough yet." I tried the same prompt against GPT-5 and Gemini and got similar results. The problem was not AI quality but where I had placed the AI in my process. I was expecting "verified output" from a generation stage. Generation is the stage where hallucination is natural; verification has to happen somewhere else. The empty seat was not the AI's to fill — it was mine.

Through my time using AI coding agents, I translated that lesson from code into the shape of a pipeline. Pieces I'd built at different moments — a Kconfig verification hook, a gate-based workflow, an HIL CI feedback loop — only in hindsight did I see they were all answering the same question: at which stage, with what verification, do I hand work to the AI? This essay is my current answer. Six skills, six gates, and the failures and trade-offs I hit at each seat.

2. Six Skills as a Frame — Gates, Not a Loop

Many AI coding guides talk about a "loop" — research, plan, implement, review, back to research. Circles are pretty but they did not match my experience. Circles have nothing to pass through. I started seeing the process as gates instead. Each stage has a pass condition — "don't verify this and the next stage gets poisoned" — and the mechanism that holds that condition is its own thing.

Here is the shape of my pipeline.

Research → Fact-Check → Plan → Fact-Check → Implement → Debug → Review. Why Fact-Check sits twice is explained in Section 2

The odd part here is that Fact-Check appears twice — once right after Research, once right after Plan. I also thought "once is enough" at first. Then I ran into a pattern several times: the research was collected cleanly, but the assumptions the AI added during planning were wrong. Implicit premises like "this library supports that platform" or "this API already exists in v2.4." These were not facts from research but new claims the planner introduced, and they needed a separate teardown.

Addy Osmani's 2026 workflow has five stages. The Claude Code docs use four: Explore → Plan → Code. Cursor's best practices — interestingly — barely use the word "hallucination" in the body and instead say "AI-generated code can look right while being subtly wrong." All three see the same phenomenon. The difference is the number of gates, and the number of gates scales with the feedback latency of the domain. Web and scripting get run-to-error feedback in seconds, so two or three gates are enough. Firmware sits with tens of minutes between compile and boot, and weeks between boot and "no intermittent bug." You need more gates — and not just more, but different kinds.

That is why I call them gates rather than loops. A loop is a question of how many times you go around; a gate is a question of where you put what. The latter is system design, the former is operational feel. This essay sits on the system-design side.

What the AI is good at and bad at also shifts per stage. At Research the AI is an excellent "keyword expander" and a terrible fact checker. At Debug it flips — the AI is an excellent log reader, and here I actually get better results by stepping back. Splitting the work into six skills is how I avoid losing that role inversion. Lump them together and everything collapses into "the AI just isn't great."

3. Skill 1 — Research: Excellent Assistant, Terrible Fact Checker

At the research stage the AI is especially good at three things. Keyword expansion (say "BLE 5.3 periodic advertising" and fifteen adjacent terms come out). Comparison tables (current draw, RX sensitivity, BOM cost across chip A/B/C). Document summarization (the two paragraphs I want from a 60-page datasheet). Use those three well and you research two to three times faster than alone.

The trouble starts right after. The AI cannot judge source credibility, cannot guarantee recency, and cannot verify domain-specific accuracy. I once trusted an AI summary over the datasheet and reversed a register bit order — a bit that had flipped between chip revisions A and B. The AI confidently served the revision-A answer. Half a day gone to debugging. The problem was not that the summary was wrong; the problem was that I had not built a way to check whether the summary was wrong.

So my Research stage now carries three hard-coded rules. First, source tagging. Every entry in the findings file gets labeled as [official], [community], or [AI inference]. That one-word tag decides the "what to doubt first" order at the Fact-Check stage. Second, concrete query design. "Find me BLE OTA docs" is a bad prompt; "official docs, release notes, and ncs-* tag commit messages for Nordic nRF Connect SDK 2.5's MCUboot swap algorithm" is a good one. The latter forces the AI to choose where to look. Third, persistence to .md. Research output always accumulates in one findings.md. Sessions can drop, context can compact — the information survives and flows cleanly into the next stage.

If you freeze it as a file — if you decide to embed the Research stage as a skill, pin four things into the skill text. ① An interface that takes "request (topic, scope)" as an argument. ② An output that appends to a single markdown file rather than overwriting. ③ A directive at the top: "analyze deeply and record the details thoroughly" (that single sentence roughly doubles or triples the perceived summary depth). ④ The rule that matters most — do not modify any file other than the one this skill writes. Miss the fourth and the day comes when the AI says "while I was at it, I also fixed main.c." Unverified edits slip into the research stage, and Fact-Check ends up breaking already-polluted input. Practically, scope tool permissions to a single write path: allowed-tools: Read, Grep, Glob, WebFetch, Write(findings.md). The Claude Code skills docs recommend exactly this shape.

4. Skill 2 — Fact-Check: Break the Plan Before You Ship It

Fact-Check will be the strangest-sounding section here. Most guides do not place a dedicated verification stage between research and implementation. I place two. One right after research, one right after planning. To explain why, I need to start with a pattern I kept hitting.

If you have ever told an AI, inside the same session, "find what's wrong with what you just researched," you know how subtly disappointing the result is. The AI leans toward confirming its own answer. It will fix typos and small wording, but the big claims — "this chip supports that feature" — usually survive. I first blamed the model. Then I saw Anthropic's automated red teaming work from 2024 and changed my mind. One model generates attacks and a different model defends. The match does not exist inside a single model, a single session. The industry had already converged on "only verification in an independent session counts." Addy Osmani calls this "secondary AI sessions to critique primary outputs." The Claude Code docs recommend a Writer/Reviewer pattern and explain it in one line: "A fresh context improves code review since Claude won't be biased toward code it just wrote." That bias is exactly the subtle disappointment I kept feeling.

So my Fact-Check skill does four things.

First, it forces an independent session. The researching session and the fact-checking session do not share context. The skill takes only a file path as input and reads from there as if seeing the document for the first time. I deliberately build the setup of handing a paper to someone who does not know the answer.

Second, it hunts "things that don't exist" deterministically. In my experience this is the most dangerous family of hallucinations. The principle behind the Kconfig-symbol verification hook is simple — extract every symbol mentioned in the research or plan and grep the actual Kconfig tree to confirm each one. Present → pass; absent → stamp [TBD: needs fact-check] and report. What matters is that it is a deterministic file-existence check, not a probabilistic AI judgment. You wrap nondeterministic generation in deterministic verification — that is exactly what the word "gate" means here. Recently an academic version of the same idea appeared. arXiv 2509.09970 validates GPT-4-generated FreeRTOS firmware in QEMU, categorizes faults into buffer overflow (CWE-120), race condition (CWE-362), and DoS (CWE-400), runs fuzzing, static analysis, and runtime checks through a three-stage agent loop, and reports a 92.4% Vulnerability Remediation Rate and a 37.3% improvement margin. The numbers are tied to that paper's sample, but the design principle — close generation's nondeterminism with verification's determinism — is the same as my hook's.

Third, it embeds a Red Team prompt. The second Fact-Check, right after planning, centers on logical weaknesses rather than cross-referencing official docs. I pin a single line into the skill:

You are a senior engineer. Find the three weakest links in the plan below,
and for each one describe a concrete failure scenario and the moment it fails.
"It'll probably be fine" counts as one of the three failures.

The last line matters more than it looks. Without pinning the Red Team role, the AI wraps up with "a mostly solid plan."

Fourth, it never modifies the source. Fact-Check output does not touch findings.md or the plan file; it writes to a separate report file. The moment a verifier edits the verification target, the verifier becomes a new source of contamination. This has to be enforced by structure, not by discipline — in Claude Code, scope allowed-tools to Read, Grep, WebFetch, Write(fact-check-report.md). Write permission opens for the report file only.

If you freeze it as a file — input is the path of the document to verify, output is a single report file. The four items to pin are exactly the four paragraphs above. One addition: Fact-Check is a skill that should have no side effects, so setting disable-model-invocation: true and only running it on explicit invocation is the safer default. The Claude Code skill system exposes that flag for exactly this use case.

5. Skill 3 — Plan: Draft It Twice, Break It Once

Plan Mode is not universal. The Claude Code docs admit this plainly: "If you could describe the diff in one sentence, skip the plan." Turning on Plan Mode for a one-sentence diff costs more than it returns. My Plan stage does not run every time — it runs when I feel "my mental model and the AI's mental model might be misaligned before I touch code." The heuristic, from experience, is roughly: if more than two files are affected, or if I am touching a system I know less well, I always run Plan.

When Plan runs, I draft twice. The first draft is the AI's; the second is the AI revising based on my annotations. At least two rounds. This is closest to Osmani's "waterfall in 15 minutes" analogy, and what matters is that the two rounds serve different goals. Round one enumerates the full list of steps. Round two attaches the trade-offs round one missed. It is not doing the same thing twice.

Round two carries one extra prompt — "where will this plan fail first?" It is the same family as the Red Team in Fact-Check #2, but the timing differs. Throwing one self-destructive question before the plan hardens turns the answer into a "risks" section that then acts as a warning light throughout implementation. Remember that the reader of the plan is future-you, and its usefulness goes up.

I pin four required elements to every plan. ① Approach detail — why this order, why not another. ② Before/after code snippets — not "refactor this part," but the actual shape of the change. ③ Exact file paths — no "modify the relevant files." ④ Explicit trade-offs — chosen approach, alternatives, reasons for rejection. Miss these four and the plan does not reach the level of "Claude can implement this right now." Most of the mid-implementation "what should I do here?" interruptions come from vague plans.

If you freeze it as a file — input is the path of a research file (or findings.md), output is a single plan document with checkboxes (- [ ]). The five things to pin into the skill: ① read the research file first (without this the AI plans from memory), ② justify the chosen order, ③ before/after code snippets, ④ exact file paths, ⑤ trade-offs. For items four and five, leaving empty slots in the skill's example template builds the habit of filling them in.

6. Skill 4 — Implement: Structured Context Cuts Hallucinations

The one thing I learned at Implement is the equation context = answer key. The AI's probability of producing the right answer is far more sensitive to the quality of the context you attach than to the "instructions" in your prompt. When I had the AI write a Devicetree overlay, I bundled the board file, the target node's DTS, and the binding YAML together via @ references — and a task that had failed three times passed on the first try. I recently saw the same observation independently at reversetobuild.com. That blog recommends the exact same pattern: inject @boards/arm/nrf52840dk_nrf52840.dts, @zephyr/dts/arm/nordic/nrf52840.dtsi, and @zephyr/dts/bindings/spi/spi-device.yaml together. Reaching the same conclusion independently tells me this is not a personal trick but a structural requirement of the domain.

Opposite structured context sits incremental implementation. Generating 200 lines at once is almost always worse than generating 20 lines ten times. The reason is simple — every 20 lines the build runs, the type checker runs, the linter runs. Errors surface immediately, and the AI generates the next 20 lines in a slightly different context. Generate 200 at once and a single error pulls the whole block down, and the AI loses context about where to start fixing. "Short and frequent" is not just a commit principle; it is a generation principle.

And the most practical rule — no type escapes. TypeScript's any and unknown, Python's Any, Go's interface{}, C/C++'s void*. These are the first exits the AI takes when stuck. When the AI plasters over a spot with any "just to make it run," that spot is exactly where the runtime error lands a few weeks later. My Implement skill text bans these explicitly, but the enforcement is a hook, not a skill. A PostToolUse hook called auto-typecheck.sh runs tsc --noEmit or mypy right after a file edit and, on any any regression or type error, blocks the tool call itself with exit 2. Skill text is persuasion; the hook is the contract. Do not mix the two.

Security code is the exception. Cryptography, authentication, signing, key management — I do not hand these to the AI. You might ask "can't you just review it?" I'll come back to that in the Review section. The short version: the buggy distributions an AI reviewer misses and the security-bug distribution overlap. So I remove them from the generation stage entirely.

If you freeze it as a file — input is a plan file path, output is code plus plan-checkbox updates. Skill items to pin: ① take the plan file as an argument and execute in order, ② mark - [ ] to - [x] on each step to reflect real-time progress in the file, ③ do not stop until all steps are done — no mid-check prompts (without this, the AI asks "continue?" every step), ④ ban any, unknown, interface{}, void*, ⑤ run the language's type checker after every file edit. Note: items ④ and ⑤ are instructed in the skill text, but the enforcement lives in the hook. Section 9 covers this separation directly.

7. Skill 5 — Debug: What the AI Is Best at Is Reading Logs

Here I flip the tone. The last five sections weighted toward "how the AI gets things wrong." Debug is the stage where I get the most out of the AI. Logs are fact data. Compiler error messages, runtime stack traces, panic dumps over serial. These are inputs the AI cannot fabricate — more precisely, inputs it has no need to fabricate — and the room for hallucination shrinks dramatically. At Debug, the share of AI suggestions I accept is higher than at any other stage.

Three tips are enough. First, pass the error message with the related source. Hand over only the error and the AI guesses. Include the 30 lines around the error line, the definitions of functions on the call stack, and the relevant headers, and the guesswork turns into analysis. Second, have the AI write a minimal reproduction. Tell it "write the smallest program that reproduces this bug" and in the process the AI has to make the bug's assumptions explicit. That explicitness often surfaces the root cause. Third, structured log formatting. Emit serial logs as JSON or at least with consistent tags ([BLE], [OTA], [MCUBOOT]) and the AI's pattern matching gets much stronger. The reason I enforced tag formats in the HIL CI story was not for the human reader — it was to make the logs easier for the AI to read.

Here this essay hits its paradox. Do not freeze Debug itself as a skill. I embedded Research, Plan, Fact-Check, Implement, and Review as skill files but deliberately left Debug out. One reason — Debug is inherently reactive. Every incident has a different error class, different related files, different reproduction conditions. Packing that into a single SKILL.md kills flexibility. The AI ends up following a "generalized debug procedure" and misses the specific oddity of this incident. Prompt patterns are better cooked on the spot, per situation.

I did pull input-bundle normalization into its own skill. I call it error-bundle, and it does exactly one thing — packs the error log, the related source files, and the reproduction conditions into a fixed shape and attaches them to the AI's context. The core work (hypothesis, root-cause tracing) stays in ad-hoc prompts; only the repetitive input prep is skillified.

This boundary surfaces a principle running through this entire essay — reactive work and productive work have different skillification returns. Productive work (Research, Plan, Implement, Review, Fact-Check) has fixed I/O, so freezing pays. Reactive work (Debug) varies per incident, so freezing the core hurts. Debug is the clearest illustration of that boundary. Skillifying is not always the answer; it is the answer only when there is a repeating shape.

8. Skill 6 — Review: Have an AI Review AI Code — Don't Trust It

In an HIL CI pipeline I built a double loop where one AI session reviews code written by another. The bugs that loop has caught fall into three types — missing edge cases, style inconsistencies, minor type weaknesses. The bugs it misses all cluster into one category — "code that runs but is bad." Architectural wrong turns, performance bottlenecks, race conditions, security holes. As Cursor put it in one line: "AI-generated code can look right while being subtly wrong." Reviewer AIs share the same weakness, so the subtle wrongness the author missed is the same subtle wrongness the reviewer misses.

So my Review skill is designed around surfacing the areas the AI cannot catch. Three rules.

First, pin a Staff Engineer persona. Personas are the oldest prompt trick in the book, but rewriting them on every call is wasteful. Put it at the top of the skill once and every review gets the "senior lens" automatically. My current persona reads: "You are a 10-year Staff Engineer. Sort production failure scenarios for this code by cost. Address style last." That last line matters — without it the AI starts with the easy wins and runs out of steam by the time it reaches real structural problems.

Second, force an independent session. Same principle as Fact-Check, and it matters even more here. When the author's session reviews its own code, confirmation bias doubles — the AI remembers the logic it just wrote and verifies only inside that logic. The Claude Code docs capture it in one line: "A fresh context improves code review since Claude won't be biased toward code it just wrote." Implementation-wise, split it into a subagent and grant only Read, Grep, Glob. A reviewer that cannot edit code is the only real reviewer.

Third, auto-flag security paths for manual review. When the reviewer generates the report, if any modified file path touches auth/, crypto/, sign/, or token/, the report inserts a top banner: "⚠ SECURITY PATH — AI review is not sufficient." That banner requires a human to read and remove it manually before the pipeline moves on. My time using these agents confirmed that the security-bug distribution does not overlap with the bugs AI is good at, and that confirmation only becomes permanent when I pin it into the skill file. Relying on memory to be careful manually every time fails eventually.

The rule that reviewers do not modify the source is identical to Fact-Check. The review report is a separate file, and only that file is writable.

If you freeze it as a file — input is a diff or list of file paths, output is a review report. Four items to pin: ① Staff Engineer persona at the top of the skill, ② force an independent session (allowed-tools: Read, Grep, Glob + subagent), ③ auto-flag for security paths, ④ no source modification (write permission limited to the report file). Along with Fact-Check, this is the skill with the highest file-freezing return — its I/O shape is fixed, so reuse pays immediately.

9. Three Traps You Hit When Building Many Skills

I have designed the six skills above and covered each one's individual requirements in the "If you freeze it as a file" blocks. But as you stack skills one after another, problems emerge that belong not to any single skill but to the entire skill repository. These three are cross-cutting warnings that do not fit inside any one section above. I stepped on each of them once while using AI coding agents.

1. Naming collisions. Generic names like planner, research, review collide the moment a second pipeline exists. My repo currently has four planners — planner (essay planning), reddit-post-planner, x-thread-planner, and impl-planner (code implementation planning). It started as a single planner. Then I built a Reddit-post pipeline and named that one planner too, and two identically named skills started colliding across contexts. Only after I added domain prefixes (impl-, reddit-post-, x-thread-) did the confusion stop. Prefix by domain from day one. engineering-essay-planner looks excessive when there is only one planner, but the day a second planner appears always comes.

2. Rule-copy debt. Copy-pasting rules like "no any," "persistent type checking," or "do not modify source documents" into multiple SKILL.md files means the day comes when you fix one copy and forget the rest. I once had "do not modify source" copy-pasted into Research, Fact-Check, and Review, and changing one policy meant editing three files. Global rules belong in .claude/rules/ or CLAUDE.md exactly once, and each skill references them. The Claude Code docs flag this sharply: "Bloated CLAUDE.md files cause Claude to ignore your actual instructions." Copy-pasted rules grow length, and growth dilutes the weight of every instruction. Cursor's guidance points the same way: "Add rules only when you notice the agent making the same mistake repeatedly." Rules are added when they earn it, and added rules live in one place.

3. Hook alignment. The third is the subtlest. If a skill is the tool for telling the AI "what to do," a hook is the tool for enforcing "what not to allow." Blur that boundary and both grow weak. My Implement skill text says "run the type checker after every file edit," but that is persuasion. The actual enforcement lives in the PostToolUse hook auto-typecheck.sh, which runs tsc --noEmit whenever a file is edited and, if any error appears, blocks the tool call with exit 2. There is no way for the AI to bypass that block. The Claude Code docs put it in one line: "Unlike CLAUDE.md instructions which are advisory, hooks are deterministic and guarantee the action happens." Instructions are advice; hooks are contracts. The "do not modify the source" rules in Fact-Check and Review work the same way — instruct it in the skill text, but the actual block comes from a PreToolUse hook like protect-docs.sh. Try to make skills and hooks carry the same responsibility and one of them will betray you.

Compressed to one line: domain-prefixed naming / global rules in one place / delegate determinism to hooks. I learned each of these the hard way, one incident each. You just read about them, so maybe you can skip one.

10. The Compounding Six Skills Build

Each of the six pays off on its own. Running the Research skill alone speeds up research; running Fact-Check alone filters out at least one wrong premise. But the real return of this structure comes from the relationships between the skills. A clean Fact-Check gives Plan a clean input; a well-written Plan cuts hallucinations at Implement; a tight Implement cuts load at Debug; a fast Debug lets Review concentrate on structural issues. Improve any one stage by 1.2× and that 1.2 multiplies into the next, until the original task feels two to three times faster. I call this AI compounding. It is a return you cannot get from single-prompt improvements.

The reason every stage needs a differently shaped gate is that every stage has a different failure mode. Research fails on source credibility, Plan fails on implicit premises, Implement fails on missing context, Debug fails on the opposite — context overflow — and Review fails on author bias. They all share the name "gate," but the machines inside are entirely different. One kind of device cannot block five kinds of failure.

Next quarter I want to try two directions. One is automatic skill chaining — Research finishes, Fact-Check fires automatically, and Plan fires only on pass. Today I type /fact-check by hand. The other is expanding hook-based gate automation — today only Implement has a hook attached, and I can see room for deterministic check hooks at Fact-Check and Review. If both land, gates will no longer be something I guard — they become something the system guards. When that moment arrives I'll have another reason to write a follow-up.

The Build Passed, So Why Doesn't It Run — Automating Firmware Tests on Real Hardware

Errata Hunter — Wed, 15 Apr 2026 21:29:49 +0000

TL;DR

A passing west build doesn't mean the firmware runs on hardware — catching silent failures manually doesn't scale.

Combining Zephyr Twister's --device-testing mode with a self-hosted runner gives you automated serial-log-based testing on real boards with every push.

All you need to start is a Raspberry Pi (or even a Windows PC) and a J-Link.

The most deflating moment in firmware development is when west build passes cleanly, you flash the board, and nothing happens. No serial output, no LED activity — just a Hard Fault dump, or worse, dead silence.

A successful build means "the code has no syntax errors." It does not mean "the firmware behaves as intended on hardware." Verifying that gap requires plugging in a J-Link, flashing, opening a serial terminal, and reading the logs by hand. Repeat this for every change, and eventually you start cutting corners: "I'll just spot-check this one." Those shortcuts compound, and regression bugs creep in.

This post documents how I automated that manual verification. On every push, the firmware is automatically flashed to a real board, serial logs are captured, and pass/fail is determined — a HIL (Hardware-in-the-Loop) CI (Continuous Integration) pipeline. I built it with Zephyr's Twister test framework, a self-hosted runner, and a single Raspberry Pi.

This is the fifth post in the "AI and Embedded Firmware" series. The previous post introduced a workflow for structurally preventing AI hallucinations, and this post closes the last gap: manual testing. Each post stands on its own — you don't need to read the series in order.

The Cost of "Just Flash It and See"

The weakest link in the four-stage loop was the final stage: testing. After the AI wrote the code, I reviewed it, and west build passed, the next step was entirely manual — plug in a J-Link (SEGGER's debug probe), run west flash, open a serial terminal, and read the logs.

Three problems with this manual routine:

First, a passing build doesn't guarantee correct behavior. AI-generated Zephyr code frequently passed west build and then silently failed on hardware. Calling k_sleep() inside a timer callback, attempting heap allocation in an ISR (Interrupt Service Routine) context — the compiler catches none of this. You only see the Hard Fault after flashing, or worse, the board just hangs with zero output. The Stack Overflow 2025 Developer Survey reported that 45% of developers spend more time debugging AI-generated code than writing it themselves.

Second, manual testing doesn't scale. Fixing one feature can break another. In web development, automated test suites catch these regressions. My firmware workflow had no such safety net. Testing every feature manually after every change is impractical, so I'd only verify "the part I just touched." That's exactly how regression bugs get in.

Third, repetition creates friction. Plug in J-Link, wait for flash, open serial monitor, scan for log patterns, record the result. Two to three minutes each time, but across dozens of iterations per day with the AI loop, the cumulative cost adds up. The real damage, though, is the temptation to skip it. And the code you skip testing on is the code that causes problems later.

The first three stages of the loop — research, planning, execution — were already efficient thanks to AI collaboration. But as long as the last stage was manual, it bottlenecked the entire pipeline. The missing piece was test automation. And in embedded, test automation means putting real hardware in the loop — HIL.

Wait — Can't I Just Test on the Board Locally?

"I already have a board on my desk and I'm running west flash — why bother with CI?" I thought the same thing at first.

Local testing and HIL CI perform the same physical actions (flash, check serial logs), but the implications differ:

	Local Testing (My Desk)	HIL CI (Automated)
When it runs	When I remember to do it	Automatically on every push
Scope	Tends to verify "just the part I changed"	Runs the entire defined test suite every time
Environment	Depends on my PC's current state	Pinned via Docker / fixed SDK version
Records	Stays in my head	Persisted in CI logs, visible to the team
Regression prevention	"Previous features are probably fine"	"Previous features still pass" — verified automatically

It's the same difference as running npm test locally versus having GitHub Actions run it on every PR. Local testing is a snapshot of "what I verified right now." CI testing is a gate that "every change must pass before merge."

This difference matters especially for firmware because regressions take far longer to surface. A web service shows error rate spikes on a monitoring dashboard immediately after deploy. Firmware can silently malfunction — intermittent BLE disconnects, failing to wake from sleep under specific conditions. You may not find out until a customer reports it. CI verifying basic behavior on every commit means that at minimum, you can git bisect to find "when exactly did it break."

Embedded CI Is Not Web CI

Web backend CI is comparatively straightforward. Push code, a cloud VM spins up, installs dependencies, runs tests, reports results. The VM starts clean every time, so environment-related flaky tests are relatively rare.

Embedded CI is fundamentally different.

The Limits of QEMU

Zephyr has a built-in test runner called Twister, and Twister can run tests on QEMU (an open-source hardware emulator). Testing without a physical board, straight from a CI server — that's appealing. The Zephyr project itself runs thousands of QEMU-based Twister tests.

But QEMU's coverage has hard limits:

Verifiable with QEMU	Not Verifiable with QEMU
Kernel scheduling, mutexes, semaphores	GPIO, SPI, I2C driver behavior
Memory allocation/deallocation logic	BLE stack (connection, pairing, data transfer)
Data structures, protocol parsing	DMA (Direct Memory Access) transfers
State machine transitions	Interrupt timing, priority inversion
Pure algorithm tests	Power management (sleep, wake)

QEMU's driver model doesn't cover every edge case — certain behaviors are considered unnecessary in an emulated environment. The core functionality of most product firmware sits in the right column. The reality is: "Most firmware is too tightly coupled with hardware for emulation to be the only path forward — at some point, the dev board is the only way to make progress."

Renode is an alternative emulator with richer peripheral emulation. Memfault's Interrupt blog covered a test automation case combining GitHub Actions and Renode. But no matter how advanced emulators get, reproducing BLE RF paths or real sensor analog characteristics remains fundamentally difficult.

Variables That Only Physical Hardware Creates

Real-board testing introduces variables that don't exist in emulation:

Timing: Virtual time in an emulator and physical time on real hardware flow differently. A 100ms timeout can pass in QEMU and fail on the board.
Power: Unstable USB hub power can reset the board or interrupt flashing mid-process. The CI log just says "connection lost."
RF environment: BLE tests are affected by ambient Wi-Fi interference. The same code can pass at the office and fail in the server room.

These variables create flaky tests. In web CI, flaky tests are mostly async timing issues fixable by code changes. In embedded CI, flaky tests are often caused by the physical environment — no amount of code changes will eliminate them.

That's the reality. Embedded CI is not a world where "correct code guarantees passing tests." But it's still better than manual testing. "Imperfect but automated verification" is more reliable in practice than "thorough but human-dependent verification." I decided to build a HIL CI pipeline.

Pipeline Design — How Far to Automate

Self-hosted Runner: The Common Pattern for Connecting Physical Boards to CI

GitHub Actions, GitLab CI/CD, Bitbucket Pipelines — all default to cloud VM runners. You can't plug an nRF52 DK into a cloud VM via USB, so all three platforms support self-hosted runners: installing a CI agent on your own physical machine.

The architecture is the same regardless of platform:

Self-hosted runner bridges the cloud CI platform and the physical board

I chose a Raspberry Pi 4 as the runner. The reason is simple: low power consumption for 24/7 operation, four USB ports for connecting multiple boards, and ARM Linux where the Zephyr toolchain runs natively. [TBD: Need to add actual Raspberry Pi performance/stability experience after use]

You Don't Need a Raspberry Pi

"Do I have to buy a Raspberry Pi?" No. A self-hosted runner is any machine that can run the CI agent software. A Linux desktop, a macOS laptop, even a Windows PC works.

Using a Windows PC as a runner:

GitHub Actions, GitLab CI, and Bitbucket Pipelines all officially support Windows runner agents. The GitHub Actions runner is best installed in a drive root folder like C:\actions-runner (to avoid Windows path length limits), and GitLab Runner provides an .exe installer.

Build and flash tools also run on Windows. west build, west flash, and nrfjprog all officially support Windows. Install nRF Command Line Tools, and nrfjprog is on your PATH. With J-Link drivers installed, you can flash to a USB-connected board immediately. Git for Windows includes Git Bash, so most shell commands in CI YAML run: blocks execute as-is.

The trade-offs:

Factor	Raspberry Pi	Windows PC
24/7 operation	5W power draw, no issue	Keeping a PC always on is impractical; sleep mode kills the runner
Docker support	Native Linux, works out of the box	Requires Docker Desktop or WSL2. nrf-docker is an amd64 Linux image, so WSL2 backend is mandatory
USB stability	Dedicated device, minimal interference	Potential port contention with other USB devices
Upfront cost	~$100 (Pi + board)	$0 if using an existing PC

I prefer a dedicated runner machine, which is why I chose the Pi. But if you're just getting started, installing the runner on an existing Windows PC and plugging the board in via USB is the lowest-friction entry point. You can split it off to a Pi later once CI is stable.

Platform Comparison

Runner registration differs across the three platforms, but the end result — "run CI jobs on a local machine with access to connected hardware" — is identical.

GitHub Actions:

# .github/workflows/hil-test.yml
name: HIL Test
on: [push, pull_request]

jobs:
  flash-and-test:
    runs-on: self-hosted  # runs on self-hosted runner
    steps:
      - uses: actions/checkout@v4
      - name: Build firmware
        run: west build -b nrf52dk/nrf52832
      - name: Flash and test
        run: west twister --device-testing --hardware-map hardware-map.yml -T tests/

GitLab CI/CD:

# .gitlab-ci.yml
hil-test:
  tags:
    - nrf52dk  # only runs on runners tagged with this label
  script:
    - west build -b nrf52dk/nrf52832
    - west twister --device-testing --hardware-map hardware-map.yml -T tests/

Bitbucket Pipelines:

# bitbucket-pipelines.yml
pipelines:
  default:
    - step:
        name: HIL Test
        runs-on:
          - self.hosted
          - linux
          - nrf52dk  # custom label
        script:
          - west build -b nrf52dk/nrf52832
          - west twister --device-testing --hardware-map hardware-map.yml -T tests/

The key difference is runner selection syntax. GitHub uses runs-on: self-hosted, GitLab uses tags:, Bitbucket uses runs-on: with a label array. The build and test commands are identical.

I found GitLab's tag system most natural for embedded. Tag runners with nrf52dk, esp32, stm32f4, and tests automatically route to the matching hardware. I'd heard that one reason the embedded/semiconductor industry favors GitLab Self-managed instances is this flexible runner tag system — after trying it myself, I can see why.

What Happens on a Single Push — Step by Step

The YAML reads as "build and test," but behind the scenes, three actors — the CI platform (cloud), the self-hosted runner (local machine), and the dev board (USB-connected) — interact through multiple sequential stages. Here's what happens at each step and where logs are generated.

The complete sequence from a single git push through build, flash, test, and verdict

Breaking it down:

Steps 1-3: Cloud. The developer pushes code. The CI platform reads the YAML, finds a matching runner, and dispatches the job. At this point, the code only exists in the cloud.

Steps 4-5: Runner build. The runner checks out the source and cross-compiles with west build. Build logs are generated here. If the build fails, it stops and the error log is uploaded to the cloud. In the split Docker architecture, this step runs on a cloud runner (amd64).

Steps 6-8: Physical interaction with the board. On a successful build, the runner uses nrfjprog to flash the firmware via USB/J-Link. The board resets, boots the new firmware, and outputs logs through the UART serial port. This log capture is the core of HIL — the runner opens the board's serial port (/dev/ttyACM0 or COM3 on Windows) and reads the output in real time.

Step 9: Verdict. Twister matches the captured serial log against regex patterns defined in testcase.yaml. If "Feature initialized successfully" appears within the timeout, it's a pass. Otherwise, fail.

Steps 10-11: Reporting. The runner uploads the verdict and log files to the cloud. The CI platform marks the PR with a check (pass or fail). On failure, serial logs are attached as artifacts for the developer to download and analyze.

Where logs are generated:

Log Type	Generated At	Contents	What to Check on Failure
Build log	Runner (steps 4-5)	Compile warnings/errors, linker errors	Missing headers, Kconfig symbol errors, memory overflow
Flash log	Runner → Board (step 6)	nrfjprog output, J-Link connection status	USB recognition failure, J-Link firmware mismatch, board power issue
Serial log	Board → Runner (step 8)	Firmware boot messages, test output, Hard Fault dumps	Init failure, ISR context violation, stack overflow
Twister verdict log	Runner (step 9)	pass/fail results, timeout info	Pattern mismatch, timeout exceeded

Reproducing the Build Environment with Docker

The most common CI failure is "it works on my PC but not in CI." The standard solution for Zephyr/NCS projects is Docker.

Nordic provides an official Docker image called nrf-docker on Docker Hub (nordicplayground/nrfconnect-sdk). It contains every dependency needed to run west commands — Zephyr SDK, Python venv, west manifest. You pull this image and use it as the build environment; you're not uploading your code to Docker Hub. It's the same idea as apt install for the compiler.

One caveat: this official image is amd64 (x86_64) only. A Raspberry Pi is ARM64 and can't run this image directly. So the CI pipeline splits into two stages:

CI pipeline splitting amd64 Docker build from ARM64 Raspberry Pi testing

How project files flow through each stage:

Stage	Where	What	How
git checkout	Cloud/local	Full source code	CI auto-clones from Git repo
Docker pull	Cloud/local	Build tools (SDK, compiler)	Downloads Nordic official image from Docker Hub
west build	Inside Docker container	Source → zephyr.hex	ARM cross-compilation (ARM binary built on amd64 host)
Artifact transfer	CI platform	zephyr.hex (~hundreds of KB)	GitHub Actions artifact, GitLab job artifact, etc.
west flash	Raspberry Pi	zephyr.hex → board	nrfjprog flashes via USB/J-Link
Twister test	Raspberry Pi	Serial logs	Captures board UART output, pattern matches

A single-stage architecture where the Raspberry Pi handles both build and flash without Docker is also viable. You'd install Zephyr SDK and west directly on the Pi. Build times are 3-5x slower than amd64, but the pipeline is simpler. I started with this single-stage setup since my project is small, and I'll switch to the split architecture if build time becomes a bottleneck.

The CI YAML for the split Docker architecture looks like this (GitHub Actions example):

# .github/workflows/hil-test.yml — split build/test architecture
name: HIL Test (Split)
on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest  # cloud runner (amd64)
    container:
      image: nordicplayground/nrfconnect-sdk:v2.9-branch
    steps:
      - uses: actions/checkout@v4
      - run: west init -l . && west update
      - run: west build -b nrf52dk/nrf52832
      - uses: actions/upload-artifact@v4
        with:
          name: firmware
          path: build/zephyr/zephyr.hex

  test:
    needs: build
    runs-on: self-hosted  # Raspberry Pi (ARM64)
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with:
          name: firmware
      - name: Flash firmware
        run: nrfjprog --program zephyr.hex --chiperase --verify --reset
      - name: Run Twister tests
        run: west twister --device-testing --hardware-map hardware-map.yml -T tests/

I pinned my SDK version using the T2 topology's west.yml, so running west init and west update inside the Docker image reproduces the exact same environment as my dev PC. Accessing USB devices from inside a Docker container requires the --device flag, and its behavior varies subtly across platforms — which is another reason I chose the split architecture.

HIL CI Works Without T2 Topology Too

The example above assumes T2 topology (a west.yml manifest at the project root). But HIL CI itself doesn't require T2. All you need is "a buildable project" and "a board to flash."

The build method in CI varies by project structure:

Project Structure	How to Build in CI	SDK Version Management
T2 topology (`west.yml` present)	`west init -l . && west update && west build`	`west.yml` pins SDK revision — high reproducibility
Freestanding (local SDK folder, `ZEPHYR_BASE` env var)	`export ZEPHYR_BASE=/path/to/sdk && west build`	Pre-install SDK on runner, or clone a specific version in CI
nRF Connect SDK + VS Code extension (GUI-based build)	Build the same project via CLI: `west build -b nrf52dk/nrf52832`	Pin SDK version via env var or Docker image tag

The simplest way to put a freestanding project into CI is to pre-install the NCS SDK on the runner machine and set ZEPHYR_BASE:

# Freestanding project CI example (GitHub Actions)
jobs:
  hil-test:
    runs-on: self-hosted  # runner with pre-installed SDK
    env:
      ZEPHYR_BASE: /home/runner/ncs/v2.9.0/zephyr
    steps:
      - uses: actions/checkout@v4
      - run: west build -b nrf52dk/nrf52832
      - run: west twister --device-testing --hardware-map hardware-map.yml -T tests/

The downside: the SDK version is tied to the runner machine. Updating the runner's SDK affects every project. That's exactly why T2 topology uses west.yml to pin SDK versions independently per project. But if you have a single project and just want to get CI running, freestanding is enough. You can upgrade the structure later.

Precedent: Golioth's Implementation

The implementation I referenced most while designing this pipeline was Golioth's HIL case study. Golioth, an IoT platform company, runs exactly this architecture — Raspberry Pi + GitHub Actions self-hosted runner + nRF52840dk — to execute automated HIL tests on every PR.

Key design decisions from Golioth:

Record all connected devices in hardware-map.yml. Serial port, device ID, platform, and runner info are managed in YAML. When a board is added or swapped, only this file needs updating.
Pre-stage WiFi/cloud credentials on the runner locally. No secrets in the repository. Setup files live on the runner machine, and the workflow references them.
Auto-detect connected boards. They wrote a script that automatically recognizes USB-connected boards and generates the hardware-map.yml. Physically swapping a board is reflected on the next CI run.

I didn't adopt this structure wholesale. Golioth is a cloud service company, so they validate network connectivity, authentication, and OTA (Over-the-Air firmware update) via HIL. My immediate need was simpler: "flash after build, verify basic behavior via serial logs." Scope your automation to match your actual needs.

Twister + Real Hardware — Writing and Running Tests

hardware-map.yml and testcase.yaml

Twister's --device-testing mode operates on two YAML files.

hardware-map.yml — physical board info connected to the runner:

# hardware-map.yml
- connected: true
  id: 000683459357        # J-Link serial number
  platform: nrf52dk/nrf52832
  product: J-Link
  runner: nrfjprog         # flashing tool
  serial: /dev/ttyACM0     # serial port
  baud: 115200

Only boards with connected: true are included as test targets. The J-Link serial number (id) uniquely identifies each board, so multiple boards on the same runner don't conflict. Twister's hardware map currently supports the pyocd, nrfjprog, jlink, openocd, and dediprog runners. Other runners are still in progress.

testcase.yaml — test definition:

# tests/my_feature/testcase.yaml
tests:
  my_app.feature.basic:
    platform_allow:
      - nrf52dk/nrf52832
    harness: console
    harness_config:
      type: one_line
      regex:
        - "Feature initialized successfully"
        - "Self-test passed"
    tags:
      - feature
      - hil

harness: console finds regex patterns in serial output to determine pass/fail. If "Feature initialized successfully" appears in the log, it passes. If the pattern doesn't appear within the timeout, it fails. Simple — but it catches more than you'd expect.

Execution command:

west twister --device-testing \
  --hardware-map hardware-map.yml \
  -T tests/ \
  -vv  # verbose output

Twister automatically builds the firmware, flashes it to the board listed in hardware-map.yml, captures serial output, matches it against testcase.yaml conditions, and reports results. west flash internally calls nrfjprog, which uses the J-Link DLL. In headless environments, the process runs without firmware update dialogs.

What Serial Logs Can and Can't Catch

"So it just checks whether my predefined log messages appear?" Yes. And that catches more than you'd think.

When debugging via serial manually, there are two modes: watching logs scroll in real time and checking "this log should appear at this timing," or dumping logs to a file and searching for keywords later. Serial log verification in CI is closer to the latter — capture the entire log, then automatically check whether predefined patterns are present or absent.

Specific firmware scenarios this simple mechanism catches:

1. Boot initialization sequence verification

Firmware typically initializes subsystems in order at boot. BLE stack, then sensor driver, then application logic. Miss a Kconfig option, and a subsystem silently drops out. Manually, you might notice "the log looks shorter than usual" and move on. CI flags it immediately when the "BLE stack initialized" pattern is missing.

# Initialization sequence testcase
tests:
  boot.init_sequence:
    harness: console
    harness_config:
      type: multi_line
      ordered: true
      regex:
        - "\\[00:00:00.0\\d+\\] <inf> app: System starting"
        - "\\[00:00:00.\\d+\\] <inf> ble: BLE stack initialized"
        - "\\[00:00:00.\\d+\\] <inf> sensor: IMU ready"
        - "\\[00:00:01.\\d+\\] <inf> app: All subsystems up"

type: multi_line with ordered: true means the patterns must appear in this exact order. Out of order or missing one — fail. I caught an issue this way when the AI refactored code and inadvertently changed the initialization order.

2. Automatic Hard Fault detection

Calling k_sleep() in ISR context or dereferencing a null pointer triggers a Hard Fault on ARM Cortex-M. Zephyr's default Fault Handler dumps registers to serial:

[00:00:01.234] <err> os: ***** HARD FAULT *****
[00:00:01.234] <err> os:   Fault escalation (see below)
[00:00:01.235] <err> os: r0/a1:  0x00000000  r1/a2:  0x20001234
[00:00:01.235] <err> os: Current thread: 0x20000458 (main)

This is a pattern that must not appear. You can set it as a failure condition in testcase.yaml:

# Fail unconditionally on Hard Fault
tests:
  safety.no_hard_fault:
    harness: console
    harness_config:
      type: one_line
      regex:
        - "All self-tests passed"
      fail_on_fault: true  # default is true, but stated explicitly for clarity

If I were watching the serial monitor myself, I'd spot the Hard Fault dump immediately. But without CI, tracing "which of the 5 commits pushed over the weekend broke it" is painful. CI running this test on every commit tells you exactly which commit introduced the fault — no git bisect needed.

3. Timing-based verification

Zephyr logs include timestamps. This lets you verify timing requirements like "BLE advertising must start within 2 seconds of boot":

# Verify advertising starts within 3 seconds of boot
tests:
  ble.adv_start_timing:
    harness: console
    harness_config:
      type: one_line
      regex:
        - "\\[00:00:0[0-2]\\.\\d+\\] <inf> ble: Advertising started"
    timeout: 5

The regex [00:00:0[0-2]\\.\\d+] only matches timestamps between 0 and 2 seconds. If advertising starts after 3 seconds, the pattern doesn't match, and the test times out as a failure.

4. Memory usage regression detection

Enabling Zephyr's CONFIG_THREAD_ANALYZER periodically logs each thread's stack usage:

[00:00:05.000] <inf> thread_analyzer:  main    : STACK: unused 512 usage 1536 / 2048 (75 %); CPU: 12 %
[00:00:05.000] <inf> thread_analyzer:  ble_rx  : STACK: unused 128 usage 896 / 1024 (87 %); CPU: 3 %

"unused 128" means only 128 bytes of stack headroom remain. You can pattern-match this and fail when headroom drops below a threshold — catching stack growth early as the AI adds code.

What this approach can't catch

Serial log pattern matching only verifies "logs I predicted in advance." Unexpected failures — BLE disconnecting after 30 minutes, sensor values drifting at certain temperatures — won't be caught unless you build tests that reproduce those specific conditions.

Real-time interactive debugging is also outside CI's scope. "Watch serial output while pressing a button at a specific moment" is still a desk job. CI's role is "automatically re-verify known correct behavior on every commit," not "discover new problems." When you do discover a new problem, you write a test for it and add it to CI — that's how test suites naturally grow thicker over time.

Automatable Tests vs. Non-automatable Tests

Not everything can be automated with HIL. Drawing the boundary clearly matters.

Automatable:

UART/RTT log output verification (string pattern matching)
State machine transition checks (log state changes, verify sequence)
Boot time measurement (timestamp-based)
I2C/SPI device response checks (when sensors are physically connected)
Memory usage reports (parsing the .map file generated at build time)

Difficult or impossible to automate:

BLE RF performance (RSSI, packet error rate) — requires dedicated test equipment
Analog sensor accuracy — requires a reference input source
Power consumption measurement — requires a current probe (Zephyr 4.2 added a power measurement harness to Twister, but it needs physical measurement hardware)
Long-duration stress tests — hits CI execution time limits
UI/display output — camera-based verification is possible but complex (Zephyr 4.3 added visual fingerprint matching)

I focused on the "automatable" list. The most common failure patterns in AI-generated code — boot initialization failures, features silently disabled by wrong Kconfig, Hard Faults from ISR context violations — are all catchable via serial logs. Aiming for perfection means never starting. "Automatically catching 80% of the most common failures" is the realistic goal.

Plugging CI into the AI Workflow — Closing the Loop

Research, Plan, Execute, Test, CI: The Final Workflow

Adding CI to the four-stage loop from series #4 produces this workflow:

HIL CI joins the four-stage AI loop, forming a closed feedback loop

Creating a PR (Pull Request) triggers CI automatically. A build failure surfaces the build log; a test failure surfaces the serial log. Standard CI so far.

Feeding CI Failure Logs Back to the AI

The differentiator is the feedback loop on failure. When CI fails, I pass the serial logs to the AI for root cause analysis and fix suggestions.

A finding from series #4: "AI's accuracy is highest when analyzing logs." Logs are factual data, which leaves little room for hallucination. The same applies to CI-captured serial logs. Hand the AI a Hard Fault register dump, stack trace, and error codes, and it provides reasonably accurate analysis: "this address corresponds to this function at this offset, and the probable cause is X."

# Workflow example: save logs on CI failure (GitHub Actions)
- name: Save failure logs
  if: failure()
  run: |
    cp twister-out/*/handler.log artifacts/
    cp twister-out/*/device.log artifacts/

- name: Upload artifacts
  if: failure()
  uses: actions/upload-artifact@v4
  with:
    name: failure-logs
    path: artifacts/

Feeding the saved logs to Claude Code:

# Request AI analysis of failure logs locally
claude "This Twister test failed in CI. Analyze device.log." \
  @artifacts/device.log

This loop isn't fully automated yet. There's manual intervention between CI failure, log download, and handing it to the AI. Tools like GitLab Duo Root Cause Analysis are narrowing this gap, but no production tool yet auto-analyzes embedded firmware serial logs. [TBD: Need to add concrete experience of the CI failure → AI analysis → fix application cycle]

Reusing Skills and Hooks in CI

The Kconfig validation hook from series #3 — a script that greps build/zephyr/.config and Kconfig sources to catch nonexistent symbols when a .conf file is modified — also works in CI.

The approach is straightforward. Include the hook script in the repo and run it before the build step in the CI workflow:

# Run Kconfig validation hook in CI
- name: Validate Kconfig
  run: |
    west build -b nrf52dk/nrf52832
    ./scripts/validate_kconfig.sh prj.conf build/zephyr/.config

Claude Code's skill fires when the AI modifies a .conf file; the CI validation catches it when a human edits .conf manually too. The same validation logic, running at two points. Tools created during AI collaboration naturally extending into CI infrastructure — that's the compounding effect of the pipeline built across this series.

Remaining Gaps and Next Steps

What HIL CI Still Can't Catch

I need to be honest. Adding HIL CI doesn't mean every hardware problem is automatically caught:

RF performance: BLE connection stability, RSSI, and packet error rate require measurement equipment (sniffer, spectrum analyzer). Serial logs only tell you "connection succeeded/failed," not "why it failed."
Long-term stability: Memory leaks and stack overflows only surface after hours or days of operation. CI workflows typically run for minutes to tens of minutes — too short to catch these.
Power consumption: Current profiles of sleep/wake cycles can't be measured without a current probe. Zephyr 4.2 added a power measurement harness to Twister, but it requires physical measurement hardware on the runner.
Multi-device interaction: BLE Central-Peripheral communication and mesh network behavior require controlling multiple boards simultaneously. Possible, but setup complexity escalates sharply.

Just as I noted in series #4 that "security-related code (encryption, Secure Boot, OTA signing) stays manually written," HIL CI also requires consciously defining the boundary between "what to automate" and "what a human verifies."

The Real Costs

Maintaining a HIL CI pipeline has costs. I won't sugarcoat them.

Minimum hardware:

Raspberry Pi 4 (~$55) + SD card + power adapter
nRF52 DK ($40) + USB cable
Total: ~$100 (one-time)

Hidden operational costs:

OS updates, security patches — neglect these and you have a security hole
SD card lifespan — heavy writes mean replacement every 1-2 years
USB connection instability — the board occasionally drops off and requires a physical reconnect
GitHub was expected to introduce a $0.002/min platform fee for self-hosted runners on private repos starting March 2026, but community pushback led to an indefinite postponement. Worth watching for future changes

For individuals or small teams running fewer than 1,000 builds per month, the cloud hosting cost savings are negligible. But if you've ever lost half a day to "the build passed but the board doesn't work," the $100 upfront investment pays for itself. Measure the value not in dollars, but in time and trust.

Reflecting on Five Posts

This post wraps up the technical content of the series. Here's the pipeline built across all five posts at a glance:

#	Post	Pipeline Layer
1	Antigravity IDE	Development environment
2	NCS T2 Topology	Project structure
3	Claude Code Skills + Hooks	AI tooling
4	Research → Plan → Execute → Test Loop	AI workflow
5	HIL CI (this post)	Automated verification

Environment, structure, tooling, methodology, verification. Each layer stands on the one below it. The IDE isolates projects via T2 topology. Claude Code skills and hooks catch AI hallucinations on that foundation. The four-stage loop structures the workflow. And HIL CI verifies it all on real hardware.

I know this setup isn't perfect. But going from "I tried having AI write firmware and it didn't work" to "a repeatable process for building firmware with AI" — that's real progress.

The next post will look back at the entire five-post journey and distill what I learned at the intersection of AI and embedded firmware development — what worked, and what remains firmly in the human domain.

How I Build Firmware with AI — A Research, Plan, Execute, Test Loop in Practice

Errata Hunter — Sat, 11 Apr 2026 21:44:44 +0000

TL;DR

Tell an AI "implement this" in firmware and you get nonexistent register addresses and ISR-incompatible APIs that pass the build but brick the board.

A 4-stage loop — research, plan, execute, test — with two human gates (datasheet cross-check, design review) stops bad information from propagating into code.

AI output needs human verification during research and planning, but for log analysis AI is faster than any human — calibrate AI involvement per stage.

I expected AI coding tools to boost my productivity. Half right, half wrong. I develop Zephyr RTOS-based firmware for nRF52/nRF53 using Claude Code as my primary tool, and the first few weeks actually made things worse. As I covered in a previous post, the AI confidently recommended Kconfig symbols that don't exist, generated register settings off by a single bit from the datasheet, and wrote code calling APIs that must never run in interrupt context.

The problem wasn't the AI's capability — it was how I used it. Copy-pasting AI output without verification might work for web frontends, but in firmware it's the fastest way to brick a board. After a month of trial and error, I settled on a research → plan → execute → test loop. I don't start firmware work without it now.

This is a field report on what I delegate to AI at each stage, where I intervene personally, and which pitfalls are specific to the firmware domain.

Why "Move Fast and Fix Things" Doesn't Work in Firmware

In web development, you edit code and hot reload gives you instant feedback. Something breaks, the browser console tells you, you fix it. The feedback loop runs in seconds.

Firmware is different. A wrong clock configuration can render the MCU unresponsive. Misconfigure a single GPIO pin and overcurrent can physically damage external circuitry. Miss a watchdog timer setup and the device enters an infinite reset loop — tracking down the cause means connecting a J-Link debugger and stepping through the boot sequence line by line. The cost of "just try it and fix later" is in a different league from the web.

Three reasons AI is particularly dangerous in this domain:

First, hallucinations pass compilation. When an LLM generates a nonexistent register address or incorrect bit mask, the C compiler treats it as a constant. The build succeeds. The problem only surfaces when you flash the board. In web development, calling a nonexistent API triggers an immediate runtime error. In firmware, "silent failures" are far more common.

Second, register maps differ between variants in the same chip family. nRF52832 and nRF52840 are both nRF52 series, but their peripheral configurations differ. When AI sees nRF52832 code in its training data and applies it directly to an nRF52840 target, the build passes but the hardware doesn't work.

Third, code generation without domain context can produce fundamentally wrong patterns. I've seen AI write a UART receive handler using dynamic memory allocation and callback chains. Reasonable in Linux userspace, but putting malloc in a UART handler running in ISR context on an MCU with 256KB of RAM leads to crashes at unpredictable times. A static ring buffer is the right answer, but AI proposes the pattern it's seen most.

Using AI in this environment requires a structure: provide sufficient context before generation, and verify the output after. That's the starting point of the 4-stage loop I've built.

The 4-Stage Loop

AI generates artifacts at each stage; humans verify at two gates.

The critical elements are two human gates.

Gate 1 (between Research and Plan): I cross-check the AI's research output against original documentation. If a hallucination slips through here, the bad information propagates into the plan and then into the code.

Gate 2 (between Plan and Execute): I review the code snippets and constraints in the AI-generated plan. Wrong init priority ordering or blocking API calls inside ISR handlers must be caught at this gate. Design errors that pass this point manifest as "build succeeds, flash succeeds, but crashes under specific conditions" — the worst debugging scenario.

Compared to the web's "code → hot reload → verify" loop, this has more steps. But in firmware, the time cost of build → flash → hardware verification is so long that "discovering bad code late" is far more expensive than "planning carefully up front." The 4-stage loop reflects that cost structure.

Every stage produces a .md file. Not a chat response that vanishes when the session ends, but a document that persists in the file system. If the session disconnects or the context window resets, I reload the previous stage's artifact and pick up where I left off. This "persistent document chain" is the infrastructure that holds the loop together.

Research — What Happens When You Feed 200 Pages of Datasheet to AI

This stage has the best time-to-value ratio of all four. When I need to work with a new peripheral, I ask AI for a structured summary instead of reading the datasheet cover to cover.

A common mistake here: feeding the entire datasheet PDF to the AI.

MCU datasheets typically run 200–800 pages. The nRF5340 Product Specification alone is hundreds of pages. Dumping all of it into context burns a significant number of input tokens. The bigger problem: with hundreds of pages loaded at once, the AI loses focus on the relevant section and starts pulling patterns from unrelated information.

My approach: feed it section by section.

If I need to implement an I2C driver, I extract just the I2C (TWI/TWIM) chapter from the datasheet. "Read only Section 6.13 TWIM from this PDF and organize the following items." I add the register map table and timing diagram pages if needed. This reduces token cost while narrowing the AI's focus, improving accuracy.

Principles for AI Research Tasks

Explicitly ask for deep analysis. Skip this and the AI returns a surface-level paraphrase of the first paragraph. The prompt structure I actually use looks like this:

Deep-dive into this MCU's TWIM (I2C Master) peripheral on the following points:
1. Init sequence — full order from clock enable to first transaction
2. Per-register bit field meanings — especially FREQUENCY, ADDRESS, ERRORSRC
3. Whether DMA setup is required or manual byte transfer is possible
4. Clock stretching support and timeout configuration
5. Any discrepancies between the official SDK [nrfx_twim](https://github.com/NordicSemiconductor/nrfx) driver and the datasheet
6. Trade-offs of each approach (DMA vs interrupt-driven, polling vs event-driven)

Ask about trade-offs from the research stage. DMA frees the CPU but consumes a DMA channel and adds configuration complexity. Interrupt-driven is simpler to implement but increases CPU load at high communication speeds. Gathering this decision material during research speeds up decision-making during the planning stage.

Save results to a .md file. I instruct: "Save the research results to research.md. Include code snippets (register setup examples, SDK API call patterns) for each item." Chat responses disappear when the session ends. A .md file can be reloaded as context for the planning stage, and it's easy to cross-check against the original datasheet side by side.

Gate 1: Human Cross-Verification

The most important action at this stage: comparing the AI's summary against the original datasheet.

If the AI reports "the TWIM FREQUENCY register value 0x06400000 corresponds to 400kHz," I verify that value directly in the datasheet's register map table. In my experience, AI gets register addresses and bit field values wrong roughly 10–15% of the time. Most errors come from mixing data between similar chip variants. Skip this gate, and incorrect register values propagate through the plan into actual code, manifesting as I2C communication failures on the board. Tracking that down might require an oscilloscope.

The research review takes me 15–30 minutes. Reading the datasheet from scratch without AI would take 2–3 hours. AI summary + cross-check in under 30 minutes. That time saving is the primary reason I use AI for research.

Feeding the full datasheet raises token cost and lowers accuracy. Section-by-section is the way.

Planning — Agree on the Design Before Writing Code

After research, the urge to start coding is strong. Resisting that urge is the second key to this workflow.

In the planning stage, I ask the AI: "Using the research document as reference, plan which files to modify, in what order, using which APIs." I always specify a few things explicitly.

Use Checklist Format

## Implementation Plan: TWIM I2C Driver

### Constraints
- Only `k_sem_give()` allowed in TWIM ISR, `k_malloc()` forbidden
- init priority: TWIM at POST_KERNEL level, after device default priority (40)
- I2C bus shared by 2 sensors → mutex required

### Implementation Items
- [ ] 1. Enable TWIM node in Devicetree overlay
- [ ] 2. Add `CONFIG_I2C`, `CONFIG_NRFX_TWIM0` to Kconfig
- [ ] 3. Write i2c_wrapper.h — define init, read, write APIs
- [ ] 4. Implement i2c_wrapper.c — nrfx_twim based, mutex-protected
- [ ] 5. Switch sensor A driver to use i2c_wrapper calls
- [ ] 6. Build verification and basic I2C scan test

Checkboxes (- [ ]) serve a purpose. During execution, I tell the AI "implement item 1 and mark the checkbox as [x]." When a session breaks or I resume the next day, opening this .md file immediately shows what's done and where things stalled.

Include Code Snippets

"Add TWIM node to the Devicetree overlay" alone is unreviewable. I have the AI write actual code snippets at the planning stage:

/* Before: no TWIM node in app.overlay */

/* After */
&i2c0 {
    compatible = "nordic,nrf-twim";
    status = "okay";
    pinctrl-0 = <&i2c0_default>;
    pinctrl-1 = <&i2c0_sleep>;
    pinctrl-names = "default", "sleep";
    clock-frequency = <I2C_BITRATE_FAST>;  /* 400 kHz */
};

This lets me check specifics during review: "Does the pinctrl name match the actual board DTS?", "Is I2C_BITRATE_FAST supported on this chip?" You can't review an abstract plan. You can review a code snippet.

Record Trade-offs

### Trade-off Analysis
| Option | Pros | Cons |
|--------|------|------|
| nrfx_twim (HAL) | Direct control, minimal overhead | No Zephyr DTS integration |
| Zephyr i2c API | DTS auto-binding, portable | Abstraction layer overhead |
→ **Choice: Zephyr i2c API** — sensor drivers already use Zephyr APIs, so compatibility wins.

This record pays off when future-me asks "why did I do it this way?" A few weeks later, when a performance issue prompts considering a switch to nrfx_twim, the decision context is right there in the .md file.

Gate 2: Human Design Review

Three points I focus on during plan review:

Init priority ordering: Wrong driver init order in Zephyr causes null pointer dereferences at boot. AI frequently overlooks this.
ISR context constraints: AI often fails to distinguish APIs callable from interrupt handlers vs. thread context. k_mutex_lock() cannot be used in ISR — catch it here.
Shared resources: Missing mutex protection on a shared I2C bus, incorrect SPI CS pin management.

Design errors that pass this gate produce "build success, flash success, but crash under specific conditions" — the worst scenario. Timing-dependent bugs are hard to reproduce and can eat half a day tracking down. Thirty minutes of careful plan review saves four hours of debugging.

Execution — One Item at a Time, Check After Each

Once the plan is approved, I have the AI write code. The principle is simple: execute one item, verify the build, mark it complete, then move to the next.

Me: "Implement item 1 from plan.md. Mark the checkbox [x] when done."
AI: (modifies Devicetree overlay, marks checkbox)
Me: west build → success confirmed
Me: "Implement item 2."
...

Applying multiple changes at once in firmware makes build errors extremely hard to trace. When Kconfig and source code changes land simultaneously, just separating "is this a config problem or a code problem?" wastes time. One-item-at-a-time execution narrows error causes to exactly one change.

When Build Errors Occur

Zephyr/west build errors are notoriously unfriendly. CMake configuration errors, Kconfig dependency conflicts, Devicetree binding mismatches, and linker errors pour out in dozens of log lines. This is where AI excels.

Paste the full error log. Not "I got a build error" — copy the entire terminal output. AI extracts the actual error line from verbose CMake traces and pinpoints causes like "this error is a dependency conflict because CONFIG_I2C is enabled but CONFIG_GPIO is missing." I use AI to identify the error category; I decide the actual fix based on the plan's context.

Document Execution History

I record build errors, workarounds, and unexpected behavior in a .md file. Short entries like "item 3: CONFIG_NRFX_TWIM0 deprecated, used CONFIG_I2C_NRFX_TWIM instead."

This record pays off in two situations. First, when a similar project hits the same issue, I hand the past record to AI and it gets "we solved this before" context immediately. Second, when the context window resets after a long conversation, reloading the execution log .md restores the current state.

Testing — Paste Logs, AI Debugs

The reality of firmware testing: no matter how thorough the unit tests, on-board verification is the final check. All I2C driver unit tests can pass, but if clock stretching timeout hits during actual sensor communication, those unit tests mean nothing.

When problems occur, my most-used pattern: paste the entire log into AI.

I copy runtime logs collected via UART or RTT and ask "analyze the root cause from this log." Here's an example:

[00:00:01.234] <inf> twim: TWIM init OK, freq=400kHz
[00:00:01.240] <inf> sensor_a: Starting I2C read, addr=0x48
[00:00:01.245] <wrn> twim: TWIM event: ERROR_SRC=0x02 (ANACK)
[00:00:01.245] <err> sensor_a: I2C read failed: -5 (EIO)
[00:00:01.250] <inf> sensor_a: Retry 1/3
[00:00:01.255] <wrn> twim: TWIM event: ERROR_SRC=0x02 (ANACK)
[00:00:01.260] <inf> sensor_a: Retry 2/3
[00:00:01.265] <wrn> twim: TWIM event: ERROR_SRC=0x02 (ANACK)
[00:00:01.270] <err> sensor_a: All retries exhausted

The AI immediately responds: "ERROR_SRC=0x02 is Address NACK. Verify sensor address 0x48. If correct, suspect missing pull-up resistors or wiring issues." A human reading this log reaches the same conclusion, but looking up whether bit 1 of the ERROR_SRC register is ANACK in the datasheet takes 5 minutes. AI does it in 1 second.

RTT (Real-Time Transfer) logs pair even better with AI than UART. RTT writes directly to a ring buffer in RAM without using any MCU peripheral, so CPU overhead is nearly zero — you can log even in timing-critical sections. Feed AI the ISR timing logs, DMA completion callback ordering, and thread context switch timestamps, and it finds patterns a human would struggle to spot in hundreds of lines: "Interrupts A and B fire in succession with only 8μs between them at this point."

This is why I consider the testing stage the highest-leverage point for AI in this workflow. During research and planning, AI output requires human verification. But in log analysis, AI is faster than a human, and the margin for error is smaller. Logs are facts, and AI extracts patterns from facts. There's less room for hallucination.

Limits exist, of course. AI can say "check the pull-up resistors," but picking up a multimeter and measuring resistance is a human job. Capturing SDA/SCL waveforms with a logic analyzer to confirm clock stretching is happening — also human. AI sets the debugging direction, but it cannot replace physical hardware verification.

Of all four stages, AI delivers the most value during testing. The input is factual data (logs).

What Changed and What Didn't

I've used this workflow for over a month. Here's what shifted.

What changed:

My role moved from "person who writes code" to "person who makes decisions and verifies." Time spent typing code shrunk. Time spent cross-checking AI output against datasheets and reviewing constraint sections of implementation plans grew.

Research time dropped by more than half. When working with a new peripheral, I ask AI for a structured summary and cross-check only the critical parts against the original — much faster than reading the datasheet from page one.

Debugging patterns changed too. I used to read error logs and mentally cycle through possible causes one by one. Now I paste logs into AI, ask for "top 3 probable causes ranked by likelihood," and start verifying from the most likely.

What didn't change:

Physical hardware testing remains beyond AI's reach. Verifying waveforms on an oscilloscope, measuring current draw, testing under various temperature conditions — still a human job.

I treat AI-generated code more conservatively for security-related work. Encryption key management, secure boot chains, OTA signature verification — a single mistake in these areas can compromise the entire product's security. I use AI for research only in this domain; code generation stays manual.

What I want to try next:

I'm considering connecting Hardware-in-the-Loop (HIL) testing to the CI pipeline. Attach physical boards to a CI server, automatically build → flash → run basic communication tests on AI-generated code. This would tighten the feedback loop after Gate 2. Still in the infrastructure setup phase, but once this loop is automated, AI utility in firmware development takes another step up.

AI doesn't replace firmware engineers. It helps firmware engineers make better decisions. But getting that help right requires structurally designing "where AI contributes and where humans intervene." The research → plan → execute → test loop is the current version of that design I've found. I plan to keep refining it.

Embedded Firmware Development with Claude Code — Devicetree, Kconfig, and Debugging

Errata Hunter — Wed, 25 Mar 2026 21:34:06 +0000

TL;DR

75% of Zephyr firmware development time goes to Kconfig/devicetree configuration and build error interpretation, not writing code — Claude Code can intervene in that 75% directly from the terminal.

Kconfig hallucination (AI inventing nonexistent symbols) is structurally preventable by combining a shell script that greps build/zephyr/.config and Kconfig source files with Claude Code's skill and hook system.

Feeding all three files (.dts, .dtsi, binding YAML) as context via the @ syntax prevents compatible string hallucination in devicetree overlays.

AI coding tools still feel distant for embedded firmware developers. "Can AI even understand devicetree?" "What if it invents Kconfig symbols?" — reasonable doubts. I had them too.

I deployed Claude Code on production Zephyr/NCS firmware projects at my day job. Kconfig hallucination broke builds. I built safeguards using Claude Code's skill and hook system to fix it. This post covers the methodology I developed. I can't share proprietary code, but I can explain where things break when you hand devicetree and Kconfig to AI, and how to structurally prevent it.

Why I Left the GUI IDE for the Terminal

A typical day for a Zephyr firmware developer breaks down like this:

Run west build → interpret CMake/Kconfig/devicetree errors
Edit prj.conf → dig through menuconfig to find the right CONFIG_* symbols
Write devicetree overlays → cross-reference .dts, .dtsi, and binding YAML simultaneously
Flash and check serial logs → track down bugs

Actual C code writing accounts for roughly 25% of the day. The remaining 75% is spent fighting configuration files and error logs.

GUI IDE AI features focus on code autocompletion — predicting function signatures, suggesting next lines. They help with the 25%. Claude Code runs in the terminal, so it can execute west build directly, read the entire build log, interpret errors, and suggest next actions. It greps .config files and traces Kconfig dependencies back to source. It can intervene in the other 75% — and that's why a terminal-based AI agent has an edge in embedded.

Dedicated embedded AI tools are emerging. Embedder (YC S25) generates driver code from uploaded PDF datasheets and is preparing serial console and GDB integration. If you work within supported chipsets (STM32, ESP32) using standard workflows, these packaged tools deliver real productivity gains.

I chose Claude Code instead. The reason comes down to what software engineering calls the double-edged sword of opinionated design. Packaged tools are productive within the Golden Path their creators designed, but friction increases the moment you step outside it. When the tool's assumed build pipeline doesn't match your project structure, you end up contorting your workflow to fit the tool. When the tool's baked-in "best practices" conflict with your hardware's nonstandard constraints, AI suggestions derail rather than help. Embedded development — where hardware configuration, SDK structure, and build pipelines vary project to project — makes this especially pronounced.

It's a tradeoff between convenience and control. I deal with Zephyr/NCS west workspace structures and per-project build configurations, so I chose to tune a general-purpose tool to fit my pipeline directly.

Devicetree: Teaching Hardware to AI

Zephyr's devicetree system has three layers. SoC .dtsi files define base hardware. Board .dts files declare pin mappings. Binding YAML files specify valid properties for each node. Developers must cross-reference all three layers when writing overlays.

To get accurate overlays from AI, you need to feed all three files as context

No Context Means Hallucination

Tell Claude Code "add SPI flash to this board" without context and you'll get a plausible but wrong overlay. The compatible string won't match any actual binding, or it'll reference a nonexistent node label.

For accurate results, feed all three files explicitly:

"Read @boards/arm/nrf52840dk_nrf52840.dts and
@zephyr/dts/arm/nordic/nrf52840.dtsi, then reference
@zephyr/dts/bindings/spi/spi-device.yaml to write
an overlay adding W25Q128 flash to SPI1."

With the @ syntax pointing to specific files, Claude Code correctly references existing SPI node labels from the board and doesn't miss required properties (reg, spi-max-frequency) defined in the binding YAML.

Three Patterns Where AI Gets Devicetree Wrong

compatible string hallucination — invents a compatible that doesn't exist in any binding. Feeding the binding YAML as context prevents this.
Node address collision — @0 must match reg = <0>, but AI assigns addresses that duplicate existing nodes. Feeding the board .dts lets it check existing assignments.
Ignoring overlay detection rules — Zephyr's build system auto-detects overlays in this order when DTC_OVERLAY_FILE is unset: socs/<SOC>.overlay → boards/<BOARD>.overlay → <BOARD>.overlay → app.overlay. If AI creates an overlay with an arbitrary filename, the build system ignores it.

The third problem is solved by adding one line to CLAUDE.md:

## Devicetree
- Overlay files must be named `app.overlay` or `boards/<BOARD>.overlay`

AI's role in devicetree work isn't "writing overlays for you." It's reducing the overhead of cross-referencing three files — checking available nodes in .dtsi, pulling required properties from binding YAML, and combining settings that don't conflict with existing .dts. The question is whether you do this manually by switching between three files, or hand all three to AI and get it in one pass.

Kconfig: Building a Skill to Prevent AI Hallucination

Zephyr's Kconfig system has thousands of symbols in a tree structure. Enabling CONFIG_BT auto-activates NET_BUF. Choosing CONFIG_BT_HCI under BT_STACK_SELECTION triggers another dependency chain. Symbols forced on by select, conditionally enabled by depends on, and suggested by imply are intertwined. The Zephyr project itself acknowledges excessive select usage and is migrating to depends on — that's how complex the dependency structure is.

Ask AI to "create a minimal prj.conf for BLE Central scan only" and you get a plausible result. But it might invent nonexistent symbols or miss required dependencies.

When I hit this problem at work, I solved it by making incremental requests to Claude Code. I didn't start with "build me a Kconfig hallucination prevention skill." I asked questions one at a time, and automation emerged naturally.

The Conversation Flow

"Where does the final Kconfig output go after west build?"

build/zephyr/.config contains the fully resolved symbol list — thousands of CONFIG_*=y/n lines. You can also check via menuconfig/guiconfig, but this file is the ground truth the build system actually uses.

"Find the .config file in this project's build output."

Claude Code locates build/zephyr/.config and shows its contents. A follow-up question:

"Does this project have a separate kernel source? Is there a separate kernel .config like in Linux?"

Unlike the Linux kernel, Zephyr builds the app and RTOS kernel into a single binary. build/zephyr/.config is the entire system's configuration. There's no separate kernel .config.

"Find all the Kconfig source files in the west workspace."

zephyr/Kconfig, zephyr/subsys/bluetooth/Kconfig, Kconfig files in each driver directory — organized into a tree. These source files contain each symbol's depends on, select, and help text. If AI references these when modifying prj.conf, it can't fabricate nonexistent symbols.

Here's where a problem arises. The .config file is thousands of lines. Kconfig source files are scattered across the entire west workspace. Reading everything every time wastes tokens.

"I want to reference these files when modifying .conf, but minimize token usage. How can I query only the relevant symbols?"

Claude Code proposes a shell script + skill combination. A script that greps for relevant symbols, called by a skill.

"Build that skill."

A Kconfig reference skill appears under .claude/skills/. When a .conf modification is requested:

Grep the built .config for the current state of relevant symbols
Extract the depends on, select, help text from Kconfig source
Use this as context when modifying .conf

Instead of reading thousands of lines, it extracts only the needed symbols and their dependencies.

"I want this skill to trigger only when modifying .conf files. Check what needs to change in .claude/."

Claude Code's hook system handles this. Set PostToolUse with matcher: "Write|Edit" in .claude/settings.json, extract the file path from stdin JSON, and conditionally trigger only for .conf files:

FILE_PATH=$(jq -r '.tool_input.file_path' < /dev/stdin)
if [[ ! "$FILE_PATH" =~ \.conf$ ]]; then
  exit 0  # ignore non-.conf files
fi
## Run Kconfig reference logic

CLAUDE.md instructions are advisory — AI can ignore them. Hooks are deterministic. They execute without exception. Making Kconfig source reference automatic on every .conf edit structurally prevents AI from inventing nonexistent symbols.

What This Flow Reveals

Six conversations produced a hallucination prevention skill. Asking "build me a Kconfig hallucination prevention skill" upfront wouldn't have worked — you can't design the solution without knowing .config exists.

The pattern here is discovering AI's limitation, then using the same AI to build a tool that fills the gap. Don't try to turn Claude Code into an embedded engineer. Instead, as the engineer, identify AI's weak spots and co-build compensating tools. That approach works.

Build Error Debugging — Where AI Delivers the Highest ROI

When west build fails, the terminal floods with errors from CMake, Kconfig, devicetree, the C compiler, and the linker all mixed together. Even experienced engineers spend time just figuring out whether an error is a devicetree problem or a Kconfig problem.

Feed the entire build log to Claude Code and it classifies errors by category, then traces root causes. Embedded build errors fall into three categories, each with different AI utility.

Devicetree Binding Mismatch

When a compatible string doesn't match any binding YAML, the build system throws an error. The error message is clear enough to solve without AI, but Claude Code finds the correct binding YAML across hundreds of directories under zephyr/dts/bindings/ and proposes a fix in one step. Manually searching takes time.

Kconfig Dependency Failure — "Silent Failure"

This one is trickier. Set CONFIG_X=y in prj.conf, but if CONFIG_X has depends on CONFIG_Y and CONFIG_Y=n, the Kconfig system silently ignores CONFIG_X. The build succeeds, but the intended feature doesn't work.

Have Claude Code compare prj.conf against build/zephyr/.config and it finds symbols present in prj.conf but missing from .config. Tracing the unmet dependency requires Kconfig source — and the Kconfig reference skill from earlier connects here too.

Linker Errors

Embedded linker errors are typically RAM/Flash overflow or duplicate symbol definitions. Claude Code reads the linker script (.ld) and build.map to identify which object files conflict. For memory overflow, it extracts each section's size from build.map and suggests reduction priorities.

Build error debugging is where Claude Code's ROI is highest. Error messages are text, root-cause tracing requires cross-referencing multiple files, and resolution patterns are relatively well-defined.

Runtime Debugging: Log Analysis and GDB

After a successful build and flash, a different kind of debugging begins.

Serial Log Pattern Analysis

Capture printk or Zephyr LOG_MODULE output over serial and feed it to Claude Code. It identifies timestamp intervals, repeating error code patterns, and state changes preceding specific events. Faster than scrolling through hundreds of lines manually.

Automating the copy-paste step is possible too. serial-mcp-server is a Rust-based MCP server that exposes UART communication as Claude Code tools — list_ports, open, read, write, close. It supports STM32, ESP32, and USB-serial converters like CH340 and FTDI. With MCP configured, you can say "open the serial port and read the log" mid-conversation.

HardFault Analysis

When a HardFault occurs during J-Link + GDB debugging, feed the call stack and register dump to Claude Code. It interprets Cortex-M CFSR (Configurable Fault Status Register) bits and traces the faulting function to build a cause hypothesis.

Stack overflows, null pointer dereferences, and unaligned memory access — common patterns — are caught accurately. Nondeterministic bugs like timing issues between DMA completion interrupts and the main loop are harder, since logs alone can't reproduce them. AI help has limits there.

Limitations and Workarounds

Areas where Claude Code struggles in embedded:

Binary protocol parsing — for byte-packed data like BLE custom profiles or proprietary sensor protocols, AI makes frequent errors in bit shifting and endianness handling. Packed struct field offset calculations vary by compiler and target architecture, and AI overlooks these differences.

Timing-critical interrupt logic — when ISR execution time is constrained to microseconds, AI generates functionally correct code but doesn't optimize for execution time. volatile access ordering, cache line alignment, and compiler barrier insertion remain the engineer's domain.

Hardware register maps — don't expect AI to know your SoC's register map accurately. It gets the general structure right, but hallucinates on specific bit reset values and reserved bit handling.

Mitigation Strategies

Keep hardware specs in CLAUDE.md, but keep it short. Claude Code's official best practices call this "Progressive Disclosure" — don't dump all information, tell AI how to find it. Instead of pasting the entire register map, write "refer to nrf52840.svd for nRF52840 register details." Long CLAUDE.md files cause AI to ignore instructions.

## Hardware
- SoC: [nRF52840](https://www.nordicsemi.com/Products/nRF52840) (Cortex-M4F, 256KB RAM, 1MB Flash)
- Board: nRF52840-DK (PCA10056)
- NCS SDK: v2.9.0
- Register reference: nrf52840.svd

Always provide verification. Have AI write code, then run west build and check the result — it catches its own mistakes. According to Claude Code's official docs, "run tests after every change — this alone increases output quality 2–3x." Unit testing is often impractical in embedded, but west build pass/fail serves as a first-order verification.

Claude Code Configuration Strategy for Embedded Projects

Effective Claude Code configuration for embedded requires separating the roles of CLAUDE.md, skills, and hooks.

Component

Purpose

Execution

CLAUDE.md

Per-session context (board, SDK, build commands)

Auto-loaded (advisory)

Skills

Domain knowledge, workflows (Kconfig reference, etc.)

On-demand

Hooks

Non-negotiable rules (.conf validation, etc.)

Auto-executed (deterministic)

CLAUDE.md: Minimal Context Only

CLAUDE.md loads every session. Exclude what AI can infer from code; include only what it can't. Board pin maps, SoC specs, NCS SDK version, and build commands are typical entries.

## Build


west build -b nrf52840dk/nrf52840 -- -DOVERLAY_CONFIG=overlay-debug.conf
west flash --runner jlink


  
  
  Conventions



Overlay filenames: app.overlay or boards/<BOARD>.overlay
Kconfig: prj.conf (shared), boards/<BOARD>.conf (board-specific)
After .conf edits, verify against build/zephyr/.config

Skills: Isolate Domain Knowledge

Knowledge needed only in specific workflows — like the Kconfig reference skill — belongs in .claude/skills/. Putting everything in CLAUDE.md wastes the context window, and as it grows longer, AI starts ignoring instructions.

Hooks: Advisory vs. Deterministic

Writing "always reference .config when editing .conf" in CLAUDE.md doesn't guarantee compliance — AI can skip it. If a rule must execute without exception, implement it as a hook. Hooks run shell scripts before or after Claude Code's tool usage, independent of AI judgment.

Build-Debug Cycle

When these components combine, the embedded build-debug cycle looks like this:

Three feedback loops converge on a single edit point — Claude Code intervenes at the red and navy nodes

Hooks auto-validate Kconfig on .conf edits. AI classifies and traces build errors. AI analyzes serial log patterns. The 75% outside of code writing — that's where AI intervenes.

Wrapping Up

Applying AI to embedded development isn't about how well AI writes code. It's about reducing friction in the other 75% — build configuration, error interpretation, log analysis.

Claude Code can intervene in that 75%. But embedded's domain specifics (Kconfig hallucination, inaccurate hardware register maps) mean you can't use it out of the box. You need skills and hooks as compensating mechanisms. After building and deploying these at my day job, the loop of using AI to compensate for AI's own limitations works.

I wrote this post methodology-first because I can't share proprietary code. When I start a personal side project, I plan to apply the same approach and share devicetree overlay sessions, the Kconfig skill in action, and build error debugging logs — with actual code.

NCS Project Management Guide: Ditching Global Install to Reclaim Control

Errata Hunter — Sun, 15 Mar 2026 21:37:25 +0000

TL;DR

Patching SDK internals in an NCS global install silently breaks every other project on your machine.

Freestanding (manual ZEPHYR_BASE binding) → T2 (app owns west.yml manifest) → T3 (separate manifest repo managing multiple apps and SDK) gives you progressive isolation.

Start with Freestanding. Move to T2 or T3 only when reproducibility or multi-product needs demand it.

When you first pick up the nRF Connect SDK (NCS), the natural move is to follow Nordic's Toolchain Manager or VS Code Extension wizard. When I set up my Zephyr dev environment with Antigravity IDE, I did exactly that. A few button clicks and the Zephyr RTOS core, libraries, and compiler land neatly under C:\ncs\toolchains or a v2.x.x folder.

The problem: this convenient global install becomes an unmanageable swamp as projects multiply.

Zephyr's ecosystem is well designed. Need to remap board pins? Use an app.overlay (Devicetree Overlay). Want to tweak system configuration? Edit prj.conf (Kconfig) and override without touching the original source.

But production firmware development never follows the textbook. Sometimes you have to reach deep into the HAL (Hardware Abstraction Layer) to dodge a chipset errata. Sometimes the only way forward is a monkey patch inside Nordic's nrfxlib.

In a global install, that single hack silently breaks every other NCS project on your machine. On top of that, every new SDK release meant downloading gigabytes onto the main drive all over again.

Under the banner of convenience, I had lost control of my build environment.

Deep Dive: True Isolation and the Essence of Freestanding

Breaking this cycle required flipping the NCS paradigm. Instead of "the Toolchain Manager owns my SDK," the goal was "my code picks and isolates its own SDK."

The first concept I ran into was the Freestanding Application.

In the Zephyr ecosystem, a manifest (west.yml) is a dependency recipe file that declares "this project needs Zephyr version X, nRF module version Y." Think of it as the equivalent of Node.js's package.json or Python's requirements.txt. One west update command pulls every source at the exact pinned version.

Freestanding skips the manifest and any complex topology. It is the most intuitive way to isolate an SDK on a local dev machine. My app lives at D:\Workspace\my_app\, the SDK sits in a completely separate location, and I bind ZEPHYR_BASE via an environment variable only in the terminal session where I build.

This is like mounting just the volumes you need into a Docker container — clean and minimal. It is the lightest starting point for physically separating your project from the SDK.

But as projects grow, Freestanding hits its limits. That is where Zephyr's official West Workspace Topology comes in. Zephyr defines three topologies based on who owns the manifest repository:

T1 (Zephyr-centric Star): Zephyr itself is the manifest. This is what you get from a default west init.
T2 (App-centric Star): Your app is the manifest. The cleanest layout for a single product.
T3 (Forest): A dedicated manifest repo manages multiple apps and the SDK as siblings. Built for multi-product teams.

This post walks through the progression from Freestanding to T2, then T3.

Step-by-step progression from global install to Freestanding, T2, and T3

Implementation: Building a Controlled Environment from Scratch

Here is the step-by-step journey from the simplest Freestanding setup, through T2, to T3 Topology.

Step 1: Pure Freestanding — The Lightest Isolation

First, I stepped out of the Toolchain Manager's shadow and installed the toolchain and West (Zephyr's meta-tool) directly inside a Python virtual environment (venv).

Open a terminal at the app directory and manually bind the SDK dependency. On Windows, a single script call does it:

:: Temporarily bind the system NCS (Zephyr) directory to this session only
> C:\ncs\v2.x.x\zephyr\zephyr-env.cmd

:: Now the build command targets the SDK on the C drive
> west build -b nrf52840dk_nrf52840

For a quick library test or a lightweight side project, this is enough. The binding lives only inside the virtual environment and leaves everything else untouched.

Step 2: T2 Topology — Let the App Own Its Dependencies

Freestanding relies on your memory or documentation to track the SDK version. Once you start building a real product, that weakness shows. A teammate should be able to git clone and reproduce the exact build environment — telling them over Slack which ZEPHYR_BASE to set is an accident waiting to happen.

T2 is what Zephyr's official docs call the Star topology (application is the manifest repository). You place a west.yml manifest directly inside your app repo, making the app itself the owner of its SDK dependencies.

my_product/              # App repo IS the manifest repo (T2)
├── .west/               # Auto-generated by west init
├── app/                 # Application source
│   ├── CMakeLists.txt
│   ├── prj.conf
│   └── src/main.c
├── west.yml             # ★ Pins Zephyr, NRF, and module versions
├── zephyr/              # Pulled by west update
└── modules/             # nrf, hal_nordic, nrfxlib, etc.

Pin a specific Zephyr tag or commit hash in west.yml, and anyone who runs west init -l . && west update anywhere gets the exact same SDK version. The reproducibility that Freestanding lacked is now baked in.

Zephyr's Manifest Imports feature simplifies west.yml authoring considerably. Instead of listing dozens of module versions by hand, you import the Zephyr manifest wholesale and only override what you need.

The caveat: T2 assumes one app = one manifest. For a single product it is hard to beat, but the moment you need to develop Product A and Product B on the same hardware platform, the model starts cracking.

Step 3: T3 Topology — Forest Structure for Multiple Apps

The moment arrived when I had to develop "Terminal A" and "Terminal B" on the same platform hardware simultaneously. Creating separate T2 workspaces for each meant duplicating multi-gigabyte SDK folders per product.

The answer was T3 Topology. Zephyr's docs call it the Forest topology — a dedicated manifest repository arranges multiple apps and the SDK as siblings at the same directory level.

my_workspace/            # Workspace root (T3 anchor)
├── .west/               # West metadata
├── manifest_repo/       # Single repo holding west.yml (dependency hub)
├── app_product_a/       # Application 1
├── app_product_b/       # Application 2
├── zephyr/              # Zephyr core pulled by West
└── modules/             # nrf, hal, nrfxlib, etc.

The decisive difference from T2: the manifest lives outside any app. A single manifest_repo/west.yml governs the Zephyr version, module versions, and even the Git revisions of every app. app_product_a and app_product_b remain fully decoupled, yet at build time they safely share the same verified SDK within my_workspace.

T3 also pays off when setting up CI/CD pipelines (e.g., GitHub Actions). The CI server clones manifest_repo, runs west update, and every app plus its dependencies land at the correct version in one shot.

Troubleshooting: The Curse of Windows Long Paths (260-char Limit)

The moment I moved to a West Workspace structure (T2 or T3) and wired up CI, builds started failing:

"No such file or directory"

NCS and Zephyr have deeply nested directory hierarchies. When CMake and Ninja generate build artifacts on top of that, the path easily blows past the Windows MAX_PATH limit of 260 characters. Unless you are using third-party tools that handle long paths natively, you will hit this.

If you are building this structure on Windows, open an admin PowerShell and run this before anything else:

# Disable the Windows 10/11 long path restriction (reboot recommended after)
New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1 -PropertyType DWORD -Force

I missed this one setting and spent half a day chasing phantom CMakeLists.txt errors.

Key differences between the three topology options at a glance

Takeaway: Which Structure Should You Pick?

After ditching the Toolchain Manager's global install, we gained three weapons for NCS project management.

There is no silver bullet. Each structure carries clear trade-offs, and the right choice depends on your project's expected lifetime and team size.

	Freestanding	T2 (Star)	T3 (Forest)
Manifest	None	App = Manifest	Separate manifest repo
App count	1	1	Multiple
SDK version pinning	Manual (`ZEPHYR_BASE`)	Pinned via `west.yml`	Pinned via `west.yml`
Reproducibility	Low	High	High
Initial setup cost	Near zero	Medium	High
Storage	Minimal	One SDK copy per app	One shared SDK copy

1. Freestanding — When You Need Isolation Right Now

Best for: One-off side projects, quick sensor driver tests you plan to throw away, getting a build environment running in under 5 minutes on a personal machine.
Downside: west does not manage versions for you. You have to remember or document which SDK version you depended on, and manually bind the environment variable every time. Three months later — or when handing the project to a teammate — reproduction may fail.

2. T2 (Star) — When You Are Building a Real Product

Best for: Serious single-product development where anyone should be able to git clone and reproduce the exact build environment. A solo developer or small team focused on one firmware project.
Downside: The entire SDK lives inside the app workspace, so adding a second product means duplicating the same Zephyr/NRF sources. Storage and west update time scale linearly with product count.

3. T3 (Forest) — When the Team Manages Multiple Products

Best for: Company-scale development with multiple products (A, B, C...) on the same hardware platform sharing common core logic. CI/CD pipeline integration is a must at this stage.
Downside: Significant learning curve for the initial manifest_repo/west.yml setup and directory structure conventions. A manifest maintainer must mediate version conflicts across products.

My own path: I started with Freestanding for rapid prototyping, moved to T2 once the product was greenlit, and switched to T3 when derivative products appeared on the same platform. You do not need to start at T3. Binding the SDK with a single zephyr-env.cmd call via Freestanding is enough. That alone is the first step toward reclaiming control in the closed NCS ecosystem.

Zephyr CMake Hell is Dead: Why I Let Google Antigravity Write My Firmware

Errata Hunter — Wed, 11 Mar 2026 21:41:32 +0000

I just burned another weekend chasing a ghost I2C timeout that wasn't even in the datasheet. Decided it was time to ditch the proprietary vendor lock-in and migrate to the modern Zephyr RTOS. My reward? Welcome to west and CMakeLists.txt hell.

A high-level representation of Zephyr's build pipeline complexity. Before the HN pedants start arguing about exact CMake module dependencies in the comments—yes, this is simplified. The point is that you shouldn't need a PhD in build systems just to toggle a GPIO.

Bare-metal firmware debugging is painful enough on its own. We shouldn't have to bleed over build system setups too. Web devs have AI agents scaffolding their entire async microservice architectures while we’re still grepping through 2010-era PDFs just to figure out a device tree path.

Why I Brought Antigravity into the Embedded World

I got sick of it. I decided to throw Google's new agent-centric IDE, Antigravity, at my firmware problems.

The software world is moving at warp speed with AI, while the embedded industry remains a walled garden of bloated IDEs and slow iteration. We need to adapt. By offloading the massive barrier to entry—Kconfig labyrinths, linker scripts, and toolchain paths—to an AI agent, we can actually focus on what matters: the core logic and hardware interactions. I’m documenting this because we need to lower the barrier to entry in bare-metal engineering. Stop fighting the build system. Get your ideas out there.

Arming Antigravity: Mandatory & Recommended Extensions

It’s not some bloated enterprise tool. Setup is dead simple.

Download the OS-specific build from [antigravity.google](https://antigravity.google/download).
It’s a VS Code fork under the hood, so your muscle memory for shortcuts and UI remains intact.

But since we are forcing an AI agent to write firmware, we have to arm it with traditional weapons for hardware debugging. Go to the Extensions tab and install these:

1. Mandatory Extensions (The Core) The nRF Extensions Quirk: In standard VS Code, you'd just lazy-install the nRF Connect for VS Code Extension Pack and call it a day. However, Antigravity’s extension search currently doesn't index the consolidated pack. Don't panic. You just have to manually search and install its four horsemen individually:

nRF Connect for VS Code: The absolute backbone for SDK, toolchain management, building, and debugging.
nRF DeviceTree: Visualizes and edits the nightmare that is device tree structures.
nRF Kconfig: GUI editor for project settings so you don't go blind reading Kconfig files.
nRF Terminal: Serial and RTT logging terminal.

2. Recommended Auxiliary Extensions To actually read the code the agent spits out and survive debugging, you should install these for development efficiency:

Cortex-Debug (marus25.cortex-debug): Provides ARM Cortex-M debugging capabilities. Do not skip this. You can't debug bare-metal if you can't dump your hardware registers.
C/C++ (Microsoft): Essential for code compilation, IntelliSense (autocomplete), and debugging support.
CMake Tools (Microsoft): Manages the CMake build system, which is the beating heart of Zephyr.

The Antigravity Workflow: Navigating the nRF Connect Sidebar

Once you have the extensions installed, you'll see the Nordic icon pop up on your left activity bar. Open it up, and you get three beautiful panels:

Welcome: This is your entry point. You use this panel to manage your SDK versions, set up toolchains, and create new projects from templates.
Application: This shows the structure of your loaded Zephyr projects. It’s where you manage not just your source code, but also keep an eye on your device tree overlays and Kconfig (prj.conf) files.
Build: The holy grail. You don't have to constantly wrestle with west build commands in the terminal anymore. Once your board target is set, you just hit the "Build" and "Flash" buttons right here. Need to open the Kconfig GUI or start a Debug session? One click. It’s a completely frictionless workflow.

What’s Next: NCS and Project Architecture

The toolchain is ready, and the workflow is set. But before we unleash the AI to actually write our firmware, there’s a multi-gigabyte elephant in the room: installing the Nordic Connect SDK (NCS) and structuring your repository.

If you dump your code inside the vendor SDK folder ("In-tree") like beginner tutorials suggest, your Git history will become a dumpster fire.

In Part 2, I’ll walk you through taming the massive NCS installation without breaking your python paths. More importantly, we’ll deep-dive into the architectural holy war of Zephyr project management: Freestanding Applications vs. T2 Topology (Workspace). I’ll break down their pros, cons, exact use cases, and how to set them up so you don't hate yourself later. Stay tuned.