DEV Community

Cover image for How I Build Firmware with AI — A Research, Plan, Execute, Test Loop in Practice
Errata Hunter
Errata Hunter

Posted on • Originally published at reversetobuild.com

How I Build Firmware with AI — A Research, Plan, Execute, Test Loop in Practice

TL;DR

  • Tell an AI "implement this" in firmware and you get nonexistent register addresses and ISR-incompatible APIs that pass the build but brick the board.
  • A 4-stage loop — research, plan, execute, test — with two human gates (datasheet cross-check, design review) stops bad information from propagating into code.
  • AI output needs human verification during research and planning, but for log analysis AI is faster than any human — calibrate AI involvement per stage.

I expected AI coding tools to boost my productivity. Half right, half wrong. I develop Zephyr RTOS-based firmware for nRF52/nRF53 using Claude Code as my primary tool, and the first few weeks actually made things worse. As I covered in a previous post, the AI confidently recommended Kconfig symbols that don't exist, generated register settings off by a single bit from the datasheet, and wrote code calling APIs that must never run in interrupt context.

The problem wasn't the AI's capability — it was how I used it. Copy-pasting AI output without verification might work for web frontends, but in firmware it's the fastest way to brick a board. After a month of trial and error, I settled on a research → plan → execute → test loop. I don't start firmware work without it now.

This is a field report on what I delegate to AI at each stage, where I intervene personally, and which pitfalls are specific to the firmware domain.


Why "Move Fast and Fix Things" Doesn't Work in Firmware

In web development, you edit code and hot reload gives you instant feedback. Something breaks, the browser console tells you, you fix it. The feedback loop runs in seconds.

Firmware is different. A wrong clock configuration can render the MCU unresponsive. Misconfigure a single GPIO pin and overcurrent can physically damage external circuitry. Miss a watchdog timer setup and the device enters an infinite reset loop — tracking down the cause means connecting a J-Link debugger and stepping through the boot sequence line by line. The cost of "just try it and fix later" is in a different league from the web.

Three reasons AI is particularly dangerous in this domain:

First, hallucinations pass compilation. When an LLM generates a nonexistent register address or incorrect bit mask, the C compiler treats it as a constant. The build succeeds. The problem only surfaces when you flash the board. In web development, calling a nonexistent API triggers an immediate runtime error. In firmware, "silent failures" are far more common.

Second, register maps differ between variants in the same chip family. nRF52832 and nRF52840 are both nRF52 series, but their peripheral configurations differ. When AI sees nRF52832 code in its training data and applies it directly to an nRF52840 target, the build passes but the hardware doesn't work.

Third, code generation without domain context can produce fundamentally wrong patterns. I've seen AI write a UART receive handler using dynamic memory allocation and callback chains. Reasonable in Linux userspace, but putting malloc in a UART handler running in ISR context on an MCU with 256KB of RAM leads to crashes at unpredictable times. A static ring buffer is the right answer, but AI proposes the pattern it's seen most.

Using AI in this environment requires a structure: provide sufficient context before generation, and verify the output after. That's the starting point of the 4-stage loop I've built.


The 4-Stage Loop

Four-stage firmware development loop with two human verification gates between Research→Plan and Plan→Execute

AI generates artifacts at each stage; humans verify at two gates.

The critical elements are two human gates.

Gate 1 (between Research and Plan): I cross-check the AI's research output against original documentation. If a hallucination slips through here, the bad information propagates into the plan and then into the code.

Gate 2 (between Plan and Execute): I review the code snippets and constraints in the AI-generated plan. Wrong init priority ordering or blocking API calls inside ISR handlers must be caught at this gate. Design errors that pass this point manifest as "build succeeds, flash succeeds, but crashes under specific conditions" — the worst debugging scenario.

Compared to the web's "code → hot reload → verify" loop, this has more steps. But in firmware, the time cost of build → flash → hardware verification is so long that "discovering bad code late" is far more expensive than "planning carefully up front." The 4-stage loop reflects that cost structure.

Every stage produces a .md file. Not a chat response that vanishes when the session ends, but a document that persists in the file system. If the session disconnects or the context window resets, I reload the previous stage's artifact and pick up where I left off. This "persistent document chain" is the infrastructure that holds the loop together.


Research — What Happens When You Feed 200 Pages of Datasheet to AI

This stage has the best time-to-value ratio of all four. When I need to work with a new peripheral, I ask AI for a structured summary instead of reading the datasheet cover to cover.

A common mistake here: feeding the entire datasheet PDF to the AI.

MCU datasheets typically run 200–800 pages. The nRF5340 Product Specification alone is hundreds of pages. Dumping all of it into context burns a significant number of input tokens. The bigger problem: with hundreds of pages loaded at once, the AI loses focus on the relevant section and starts pulling patterns from unrelated information.

My approach: feed it section by section.

If I need to implement an I2C driver, I extract just the I2C (TWI/TWIM) chapter from the datasheet. "Read only Section 6.13 TWIM from this PDF and organize the following items." I add the register map table and timing diagram pages if needed. This reduces token cost while narrowing the AI's focus, improving accuracy.

Principles for AI Research Tasks

Explicitly ask for deep analysis. Skip this and the AI returns a surface-level paraphrase of the first paragraph. The prompt structure I actually use looks like this:

Deep-dive into this MCU's TWIM (I2C Master) peripheral on the following points:
1. Init sequence — full order from clock enable to first transaction
2. Per-register bit field meanings — especially FREQUENCY, ADDRESS, ERRORSRC
3. Whether DMA setup is required or manual byte transfer is possible
4. Clock stretching support and timeout configuration
5. Any discrepancies between the official SDK [nrfx_twim](https://github.com/NordicSemiconductor/nrfx) driver and the datasheet
6. Trade-offs of each approach (DMA vs interrupt-driven, polling vs event-driven)
Enter fullscreen mode Exit fullscreen mode

Ask about trade-offs from the research stage. DMA frees the CPU but consumes a DMA channel and adds configuration complexity. Interrupt-driven is simpler to implement but increases CPU load at high communication speeds. Gathering this decision material during research speeds up decision-making during the planning stage.

Save results to a .md file. I instruct: "Save the research results to research.md. Include code snippets (register setup examples, SDK API call patterns) for each item." Chat responses disappear when the session ends. A .md file can be reloaded as context for the planning stage, and it's easy to cross-check against the original datasheet side by side.

Gate 1: Human Cross-Verification

The most important action at this stage: comparing the AI's summary against the original datasheet.

If the AI reports "the TWIM FREQUENCY register value 0x06400000 corresponds to 400kHz," I verify that value directly in the datasheet's register map table. In my experience, AI gets register addresses and bit field values wrong roughly 10–15% of the time. Most errors come from mixing data between similar chip variants. Skip this gate, and incorrect register values propagate through the plan into actual code, manifesting as I2C communication failures on the board. Tracking that down might require an oscilloscope.

The research review takes me 15–30 minutes. Reading the datasheet from scratch without AI would take 2–3 hours. AI summary + cross-check in under 30 minutes. That time saving is the primary reason I use AI for research.

Datasheet feeding strategy comparison — full 700-page PDF input versus section-by-section approach showing accuracy and token cost tradeoffs

Feeding the full datasheet raises token cost and lowers accuracy. Section-by-section is the way.


Planning — Agree on the Design Before Writing Code

After research, the urge to start coding is strong. Resisting that urge is the second key to this workflow.

In the planning stage, I ask the AI: "Using the research document as reference, plan which files to modify, in what order, using which APIs." I always specify a few things explicitly.

Use Checklist Format

## Implementation Plan: TWIM I2C Driver

### Constraints
- Only `k_sem_give()` allowed in TWIM ISR, `k_malloc()` forbidden
- init priority: TWIM at POST_KERNEL level, after device default priority (40)
- I2C bus shared by 2 sensors → mutex required

### Implementation Items
- [ ] 1. Enable TWIM node in Devicetree overlay
- [ ] 2. Add `CONFIG_I2C`, `CONFIG_NRFX_TWIM0` to Kconfig
- [ ] 3. Write i2c_wrapper.h — define init, read, write APIs
- [ ] 4. Implement i2c_wrapper.c — nrfx_twim based, mutex-protected
- [ ] 5. Switch sensor A driver to use i2c_wrapper calls
- [ ] 6. Build verification and basic I2C scan test
Enter fullscreen mode Exit fullscreen mode

Checkboxes (- [ ]) serve a purpose. During execution, I tell the AI "implement item 1 and mark the checkbox as [x]." When a session breaks or I resume the next day, opening this .md file immediately shows what's done and where things stalled.

Include Code Snippets

"Add TWIM node to the Devicetree overlay" alone is unreviewable. I have the AI write actual code snippets at the planning stage:

/* Before: no TWIM node in app.overlay */

/* After */
&i2c0 {
    compatible = "nordic,nrf-twim";
    status = "okay";
    pinctrl-0 = <&i2c0_default>;
    pinctrl-1 = <&i2c0_sleep>;
    pinctrl-names = "default", "sleep";
    clock-frequency = <I2C_BITRATE_FAST>;  /* 400 kHz */
};
Enter fullscreen mode Exit fullscreen mode

This lets me check specifics during review: "Does the pinctrl name match the actual board DTS?", "Is I2C_BITRATE_FAST supported on this chip?" You can't review an abstract plan. You can review a code snippet.

Record Trade-offs

### Trade-off Analysis
| Option | Pros | Cons |
|--------|------|------|
| nrfx_twim (HAL) | Direct control, minimal overhead | No Zephyr DTS integration |
| Zephyr i2c API | DTS auto-binding, portable | Abstraction layer overhead |
→ **Choice: Zephyr i2c API** — sensor drivers already use Zephyr APIs, so compatibility wins.
Enter fullscreen mode Exit fullscreen mode

This record pays off when future-me asks "why did I do it this way?" A few weeks later, when a performance issue prompts considering a switch to nrfx_twim, the decision context is right there in the .md file.

Gate 2: Human Design Review

Three points I focus on during plan review:

  1. Init priority ordering: Wrong driver init order in Zephyr causes null pointer dereferences at boot. AI frequently overlooks this.
  2. ISR context constraints: AI often fails to distinguish APIs callable from interrupt handlers vs. thread context. k_mutex_lock() cannot be used in ISR — catch it here.
  3. Shared resources: Missing mutex protection on a shared I2C bus, incorrect SPI CS pin management.

Design errors that pass this gate produce "build success, flash success, but crash under specific conditions" — the worst scenario. Timing-dependent bugs are hard to reproduce and can eat half a day tracking down. Thirty minutes of careful plan review saves four hours of debugging.


Execution — One Item at a Time, Check After Each

Once the plan is approved, I have the AI write code. The principle is simple: execute one item, verify the build, mark it complete, then move to the next.

Me: "Implement item 1 from plan.md. Mark the checkbox [x] when done."
AI: (modifies Devicetree overlay, marks checkbox)
Me: west build → success confirmed
Me: "Implement item 2."
...
Enter fullscreen mode Exit fullscreen mode

Applying multiple changes at once in firmware makes build errors extremely hard to trace. When Kconfig and source code changes land simultaneously, just separating "is this a config problem or a code problem?" wastes time. One-item-at-a-time execution narrows error causes to exactly one change.

When Build Errors Occur

Zephyr/west build errors are notoriously unfriendly. CMake configuration errors, Kconfig dependency conflicts, Devicetree binding mismatches, and linker errors pour out in dozens of log lines. This is where AI excels.

Paste the full error log. Not "I got a build error" — copy the entire terminal output. AI extracts the actual error line from verbose CMake traces and pinpoints causes like "this error is a dependency conflict because CONFIG_I2C is enabled but CONFIG_GPIO is missing." I use AI to identify the error category; I decide the actual fix based on the plan's context.

Document Execution History

I record build errors, workarounds, and unexpected behavior in a .md file. Short entries like "item 3: CONFIG_NRFX_TWIM0 deprecated, used CONFIG_I2C_NRFX_TWIM instead."

This record pays off in two situations. First, when a similar project hits the same issue, I hand the past record to AI and it gets "we solved this before" context immediately. Second, when the context window resets after a long conversation, reloading the execution log .md restores the current state.


Testing — Paste Logs, AI Debugs

The reality of firmware testing: no matter how thorough the unit tests, on-board verification is the final check. All I2C driver unit tests can pass, but if clock stretching timeout hits during actual sensor communication, those unit tests mean nothing.

When problems occur, my most-used pattern: paste the entire log into AI.

I copy runtime logs collected via UART or RTT and ask "analyze the root cause from this log." Here's an example:

[00:00:01.234] <inf> twim: TWIM init OK, freq=400kHz
[00:00:01.240] <inf> sensor_a: Starting I2C read, addr=0x48
[00:00:01.245] <wrn> twim: TWIM event: ERROR_SRC=0x02 (ANACK)
[00:00:01.245] <err> sensor_a: I2C read failed: -5 (EIO)
[00:00:01.250] <inf> sensor_a: Retry 1/3
[00:00:01.255] <wrn> twim: TWIM event: ERROR_SRC=0x02 (ANACK)
[00:00:01.260] <inf> sensor_a: Retry 2/3
[00:00:01.265] <wrn> twim: TWIM event: ERROR_SRC=0x02 (ANACK)
[00:00:01.270] <err> sensor_a: All retries exhausted
Enter fullscreen mode Exit fullscreen mode

The AI immediately responds: "ERROR_SRC=0x02 is Address NACK. Verify sensor address 0x48. If correct, suspect missing pull-up resistors or wiring issues." A human reading this log reaches the same conclusion, but looking up whether bit 1 of the ERROR_SRC register is ANACK in the datasheet takes 5 minutes. AI does it in 1 second.

RTT (Real-Time Transfer) logs pair even better with AI than UART. RTT writes directly to a ring buffer in RAM without using any MCU peripheral, so CPU overhead is nearly zero — you can log even in timing-critical sections. Feed AI the ISR timing logs, DMA completion callback ordering, and thread context switch timestamps, and it finds patterns a human would struggle to spot in hundreds of lines: "Interrupts A and B fire in succession with only 8μs between them at this point."

This is why I consider the testing stage the highest-leverage point for AI in this workflow. During research and planning, AI output requires human verification. But in log analysis, AI is faster than a human, and the margin for error is smaller. Logs are facts, and AI extracts patterns from facts. There's less room for hallucination.

Limits exist, of course. AI can say "check the pull-up resistors," but picking up a multimeter and measuring resistance is a human job. Capturing SDA/SCL waveforms with a logic analyzer to confirm clock stretching is happening — also human. AI sets the debugging direction, but it cannot replace physical hardware verification.

AI effectiveness spectrum across the 4-stage loop — lowest in code generation, highest in log analysis where inputs are factual data

Of all four stages, AI delivers the most value during testing. The input is factual data (logs).


What Changed and What Didn't

I've used this workflow for over a month. Here's what shifted.

What changed:

My role moved from "person who writes code" to "person who makes decisions and verifies." Time spent typing code shrunk. Time spent cross-checking AI output against datasheets and reviewing constraint sections of implementation plans grew.

Research time dropped by more than half. When working with a new peripheral, I ask AI for a structured summary and cross-check only the critical parts against the original — much faster than reading the datasheet from page one.

Debugging patterns changed too. I used to read error logs and mentally cycle through possible causes one by one. Now I paste logs into AI, ask for "top 3 probable causes ranked by likelihood," and start verifying from the most likely.

What didn't change:

Physical hardware testing remains beyond AI's reach. Verifying waveforms on an oscilloscope, measuring current draw, testing under various temperature conditions — still a human job.

I treat AI-generated code more conservatively for security-related work. Encryption key management, secure boot chains, OTA signature verification — a single mistake in these areas can compromise the entire product's security. I use AI for research only in this domain; code generation stays manual.

What I want to try next:

I'm considering connecting Hardware-in-the-Loop (HIL) testing to the CI pipeline. Attach physical boards to a CI server, automatically build → flash → run basic communication tests on AI-generated code. This would tighten the feedback loop after Gate 2. Still in the infrastructure setup phase, but once this loop is automated, AI utility in firmware development takes another step up.

AI doesn't replace firmware engineers. It helps firmware engineers make better decisions. But getting that help right requires structurally designing "where AI contributes and where humans intervene." The research → plan → execute → test loop is the current version of that design I've found. I plan to keep refining it.

Top comments (0)