DEV Community

Cover image for The Build Passed, So Why Doesn't It Run — Automating Firmware Tests on Real Hardware
Errata Hunter
Errata Hunter

Posted on • Originally published at reversetobuild.com

The Build Passed, So Why Doesn't It Run — Automating Firmware Tests on Real Hardware

TL;DR

  • A passing west build doesn't mean the firmware runs on hardware — catching silent failures manually doesn't scale.
  • Combining Zephyr Twister's --device-testing mode with a self-hosted runner gives you automated serial-log-based testing on real boards with every push.
  • All you need to start is a Raspberry Pi (or even a Windows PC) and a J-Link.

The most deflating moment in firmware development is when west build passes cleanly, you flash the board, and nothing happens. No serial output, no LED activity — just a Hard Fault dump, or worse, dead silence.

A successful build means "the code has no syntax errors." It does not mean "the firmware behaves as intended on hardware." Verifying that gap requires plugging in a J-Link, flashing, opening a serial terminal, and reading the logs by hand. Repeat this for every change, and eventually you start cutting corners: "I'll just spot-check this one." Those shortcuts compound, and regression bugs creep in.

This post documents how I automated that manual verification. On every push, the firmware is automatically flashed to a real board, serial logs are captured, and pass/fail is determined — a HIL (Hardware-in-the-Loop) CI (Continuous Integration) pipeline. I built it with Zephyr's Twister test framework, a self-hosted runner, and a single Raspberry Pi.

This is the fifth post in the "AI and Embedded Firmware" series. The previous post introduced a workflow for structurally preventing AI hallucinations, and this post closes the last gap: manual testing. Each post stands on its own — you don't need to read the series in order.


The Cost of "Just Flash It and See"

The weakest link in the four-stage loop was the final stage: testing. After the AI wrote the code, I reviewed it, and west build passed, the next step was entirely manual — plug in a J-Link (SEGGER's debug probe), run west flash, open a serial terminal, and read the logs.

Three problems with this manual routine:

First, a passing build doesn't guarantee correct behavior. AI-generated Zephyr code frequently passed west build and then silently failed on hardware. Calling k_sleep() inside a timer callback, attempting heap allocation in an ISR (Interrupt Service Routine) context — the compiler catches none of this. You only see the Hard Fault after flashing, or worse, the board just hangs with zero output. The Stack Overflow 2025 Developer Survey reported that 45% of developers spend more time debugging AI-generated code than writing it themselves.

Second, manual testing doesn't scale. Fixing one feature can break another. In web development, automated test suites catch these regressions. My firmware workflow had no such safety net. Testing every feature manually after every change is impractical, so I'd only verify "the part I just touched." That's exactly how regression bugs get in.

Third, repetition creates friction. Plug in J-Link, wait for flash, open serial monitor, scan for log patterns, record the result. Two to three minutes each time, but across dozens of iterations per day with the AI loop, the cumulative cost adds up. The real damage, though, is the temptation to skip it. And the code you skip testing on is the code that causes problems later.

The first three stages of the loop — research, planning, execution — were already efficient thanks to AI collaboration. But as long as the last stage was manual, it bottlenecked the entire pipeline. The missing piece was test automation. And in embedded, test automation means putting real hardware in the loop — HIL.

Wait — Can't I Just Test on the Board Locally?

"I already have a board on my desk and I'm running west flash — why bother with CI?" I thought the same thing at first.

Local testing and HIL CI perform the same physical actions (flash, check serial logs), but the implications differ:

Local Testing (My Desk) HIL CI (Automated)
When it runs When I remember to do it Automatically on every push
Scope Tends to verify "just the part I changed" Runs the entire defined test suite every time
Environment Depends on my PC's current state Pinned via Docker / fixed SDK version
Records Stays in my head Persisted in CI logs, visible to the team
Regression prevention "Previous features are probably fine" "Previous features still pass" — verified automatically

It's the same difference as running npm test locally versus having GitHub Actions run it on every PR. Local testing is a snapshot of "what I verified right now." CI testing is a gate that "every change must pass before merge."

This difference matters especially for firmware because regressions take far longer to surface. A web service shows error rate spikes on a monitoring dashboard immediately after deploy. Firmware can silently malfunction — intermittent BLE disconnects, failing to wake from sleep under specific conditions. You may not find out until a customer reports it. CI verifying basic behavior on every commit means that at minimum, you can git bisect to find "when exactly did it break."


Embedded CI Is Not Web CI

Web backend CI is comparatively straightforward. Push code, a cloud VM spins up, installs dependencies, runs tests, reports results. The VM starts clean every time, so environment-related flaky tests are relatively rare.

Embedded CI is fundamentally different.

The Limits of QEMU

Zephyr has a built-in test runner called Twister, and Twister can run tests on QEMU (an open-source hardware emulator). Testing without a physical board, straight from a CI server — that's appealing. The Zephyr project itself runs thousands of QEMU-based Twister tests.

But QEMU's coverage has hard limits:

Verifiable with QEMU Not Verifiable with QEMU
Kernel scheduling, mutexes, semaphores GPIO, SPI, I2C driver behavior
Memory allocation/deallocation logic BLE stack (connection, pairing, data transfer)
Data structures, protocol parsing DMA (Direct Memory Access) transfers
State machine transitions Interrupt timing, priority inversion
Pure algorithm tests Power management (sleep, wake)

QEMU's driver model doesn't cover every edge case — certain behaviors are considered unnecessary in an emulated environment. The core functionality of most product firmware sits in the right column. The reality is: "Most firmware is too tightly coupled with hardware for emulation to be the only path forward — at some point, the dev board is the only way to make progress."

Renode is an alternative emulator with richer peripheral emulation. Memfault's Interrupt blog covered a test automation case combining GitHub Actions and Renode. But no matter how advanced emulators get, reproducing BLE RF paths or real sensor analog characteristics remains fundamentally difficult.

Variables That Only Physical Hardware Creates

Real-board testing introduces variables that don't exist in emulation:

  • Timing: Virtual time in an emulator and physical time on real hardware flow differently. A 100ms timeout can pass in QEMU and fail on the board.
  • Power: Unstable USB hub power can reset the board or interrupt flashing mid-process. The CI log just says "connection lost."
  • RF environment: BLE tests are affected by ambient Wi-Fi interference. The same code can pass at the office and fail in the server room.

These variables create flaky tests. In web CI, flaky tests are mostly async timing issues fixable by code changes. In embedded CI, flaky tests are often caused by the physical environment — no amount of code changes will eliminate them.

That's the reality. Embedded CI is not a world where "correct code guarantees passing tests." But it's still better than manual testing. "Imperfect but automated verification" is more reliable in practice than "thorough but human-dependent verification." I decided to build a HIL CI pipeline.


Pipeline Design — How Far to Automate

Self-hosted Runner: The Common Pattern for Connecting Physical Boards to CI

GitHub Actions, GitLab CI/CD, Bitbucket Pipelines — all default to cloud VM runners. You can't plug an nRF52 DK into a cloud VM via USB, so all three platforms support self-hosted runners: installing a CI agent on your own physical machine.

The architecture is the same regardless of platform:

HIL CI architecture — Git push triggers cloud workflow, self-hosted runner flashes firmware via USB to nRF52 DK

Self-hosted runner bridges the cloud CI platform and the physical board

I chose a Raspberry Pi 4 as the runner. The reason is simple: low power consumption for 24/7 operation, four USB ports for connecting multiple boards, and ARM Linux where the Zephyr toolchain runs natively. [TBD: Need to add actual Raspberry Pi performance/stability experience after use]

You Don't Need a Raspberry Pi

"Do I have to buy a Raspberry Pi?" No. A self-hosted runner is any machine that can run the CI agent software. A Linux desktop, a macOS laptop, even a Windows PC works.

Using a Windows PC as a runner:

GitHub Actions, GitLab CI, and Bitbucket Pipelines all officially support Windows runner agents. The GitHub Actions runner is best installed in a drive root folder like C:\actions-runner (to avoid Windows path length limits), and GitLab Runner provides an .exe installer.

Build and flash tools also run on Windows. west build, west flash, and nrfjprog all officially support Windows. Install nRF Command Line Tools, and nrfjprog is on your PATH. With J-Link drivers installed, you can flash to a USB-connected board immediately. Git for Windows includes Git Bash, so most shell commands in CI YAML run: blocks execute as-is.

The trade-offs:

Factor Raspberry Pi Windows PC
24/7 operation 5W power draw, no issue Keeping a PC always on is impractical; sleep mode kills the runner
Docker support Native Linux, works out of the box Requires Docker Desktop or WSL2. nrf-docker is an amd64 Linux image, so WSL2 backend is mandatory
USB stability Dedicated device, minimal interference Potential port contention with other USB devices
Upfront cost ~$100 (Pi + board) $0 if using an existing PC

I prefer a dedicated runner machine, which is why I chose the Pi. But if you're just getting started, installing the runner on an existing Windows PC and plugging the board in via USB is the lowest-friction entry point. You can split it off to a Pi later once CI is stable.

Platform Comparison

Runner registration differs across the three platforms, but the end result — "run CI jobs on a local machine with access to connected hardware" — is identical.

GitHub Actions:

# .github/workflows/hil-test.yml
name: HIL Test
on: [push, pull_request]

jobs:
  flash-and-test:
    runs-on: self-hosted  # runs on self-hosted runner
    steps:
      - uses: actions/checkout@v4
      - name: Build firmware
        run: west build -b nrf52dk/nrf52832
      - name: Flash and test
        run: west twister --device-testing --hardware-map hardware-map.yml -T tests/
Enter fullscreen mode Exit fullscreen mode

GitLab CI/CD:

# .gitlab-ci.yml
hil-test:
  tags:
    - nrf52dk  # only runs on runners tagged with this label
  script:
    - west build -b nrf52dk/nrf52832
    - west twister --device-testing --hardware-map hardware-map.yml -T tests/
Enter fullscreen mode Exit fullscreen mode

Bitbucket Pipelines:

# bitbucket-pipelines.yml
pipelines:
  default:
    - step:
        name: HIL Test
        runs-on:
          - self.hosted
          - linux
          - nrf52dk  # custom label
        script:
          - west build -b nrf52dk/nrf52832
          - west twister --device-testing --hardware-map hardware-map.yml -T tests/
Enter fullscreen mode Exit fullscreen mode

The key difference is runner selection syntax. GitHub uses runs-on: self-hosted, GitLab uses tags:, Bitbucket uses runs-on: with a label array. The build and test commands are identical.

I found GitLab's tag system most natural for embedded. Tag runners with nrf52dk, esp32, stm32f4, and tests automatically route to the matching hardware. I'd heard that one reason the embedded/semiconductor industry favors GitLab Self-managed instances is this flexible runner tag system — after trying it myself, I can see why.

What Happens on a Single Push — Step by Step

The YAML reads as "build and test," but behind the scenes, three actors — the CI platform (cloud), the self-hosted runner (local machine), and the dev board (USB-connected) — interact through multiple sequential stages. Here's what happens at each step and where logs are generated.

HIL CI sequence diagram — step-by-step interaction between CI platform, self-hosted runner, and development board during a single push event

The complete sequence from a single git push through build, flash, test, and verdict

Breaking it down:

Steps 1-3: Cloud. The developer pushes code. The CI platform reads the YAML, finds a matching runner, and dispatches the job. At this point, the code only exists in the cloud.

Steps 4-5: Runner build. The runner checks out the source and cross-compiles with west build. Build logs are generated here. If the build fails, it stops and the error log is uploaded to the cloud. In the split Docker architecture, this step runs on a cloud runner (amd64).

Steps 6-8: Physical interaction with the board. On a successful build, the runner uses nrfjprog to flash the firmware via USB/J-Link. The board resets, boots the new firmware, and outputs logs through the UART serial port. This log capture is the core of HIL — the runner opens the board's serial port (/dev/ttyACM0 or COM3 on Windows) and reads the output in real time.

Step 9: Verdict. Twister matches the captured serial log against regex patterns defined in testcase.yaml. If "Feature initialized successfully" appears within the timeout, it's a pass. Otherwise, fail.

Steps 10-11: Reporting. The runner uploads the verdict and log files to the cloud. The CI platform marks the PR with a check (pass or fail). On failure, serial logs are attached as artifacts for the developer to download and analyze.

Where logs are generated:

Log Type Generated At Contents What to Check on Failure
Build log Runner (steps 4-5) Compile warnings/errors, linker errors Missing headers, Kconfig symbol errors, memory overflow
Flash log Runner → Board (step 6) nrfjprog output, J-Link connection status USB recognition failure, J-Link firmware mismatch, board power issue
Serial log Board → Runner (step 8) Firmware boot messages, test output, Hard Fault dumps Init failure, ISR context violation, stack overflow
Twister verdict log Runner (step 9) pass/fail results, timeout info Pattern mismatch, timeout exceeded

Reproducing the Build Environment with Docker

The most common CI failure is "it works on my PC but not in CI." The standard solution for Zephyr/NCS projects is Docker.

Nordic provides an official Docker image called nrf-docker on Docker Hub (nordicplayground/nrfconnect-sdk). It contains every dependency needed to run west commands — Zephyr SDK, Python venv, west manifest. You pull this image and use it as the build environment; you're not uploading your code to Docker Hub. It's the same idea as apt install for the compiler.

One caveat: this official image is amd64 (x86_64) only. A Raspberry Pi is ARM64 and can't run this image directly. So the CI pipeline splits into two stages:

Split CI pipeline — Docker build on amd64 cloud runner, flash and test on ARM64 Raspberry Pi self-hosted runner

CI pipeline splitting amd64 Docker build from ARM64 Raspberry Pi testing

How project files flow through each stage:

Stage Where What How
git checkout Cloud/local Full source code CI auto-clones from Git repo
Docker pull Cloud/local Build tools (SDK, compiler) Downloads Nordic official image from Docker Hub
west build Inside Docker container Source → zephyr.hex ARM cross-compilation (ARM binary built on amd64 host)
Artifact transfer CI platform zephyr.hex (~hundreds of KB) GitHub Actions artifact, GitLab job artifact, etc.
west flash Raspberry Pi zephyr.hex → board nrfjprog flashes via USB/J-Link
Twister test Raspberry Pi Serial logs Captures board UART output, pattern matches

A single-stage architecture where the Raspberry Pi handles both build and flash without Docker is also viable. You'd install Zephyr SDK and west directly on the Pi. Build times are 3-5x slower than amd64, but the pipeline is simpler. I started with this single-stage setup since my project is small, and I'll switch to the split architecture if build time becomes a bottleneck.

The CI YAML for the split Docker architecture looks like this (GitHub Actions example):

# .github/workflows/hil-test.yml — split build/test architecture
name: HIL Test (Split)
on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest  # cloud runner (amd64)
    container:
      image: nordicplayground/nrfconnect-sdk:v2.9-branch
    steps:
      - uses: actions/checkout@v4
      - run: west init -l . && west update
      - run: west build -b nrf52dk/nrf52832
      - uses: actions/upload-artifact@v4
        with:
          name: firmware
          path: build/zephyr/zephyr.hex

  test:
    needs: build
    runs-on: self-hosted  # Raspberry Pi (ARM64)
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with:
          name: firmware
      - name: Flash firmware
        run: nrfjprog --program zephyr.hex --chiperase --verify --reset
      - name: Run Twister tests
        run: west twister --device-testing --hardware-map hardware-map.yml -T tests/
Enter fullscreen mode Exit fullscreen mode

I pinned my SDK version using the T2 topology's west.yml, so running west init and west update inside the Docker image reproduces the exact same environment as my dev PC. Accessing USB devices from inside a Docker container requires the --device flag, and its behavior varies subtly across platforms — which is another reason I chose the split architecture.

HIL CI Works Without T2 Topology Too

The example above assumes T2 topology (a west.yml manifest at the project root). But HIL CI itself doesn't require T2. All you need is "a buildable project" and "a board to flash."

The build method in CI varies by project structure:

Project Structure How to Build in CI SDK Version Management
T2 topology (west.yml present) west init -l . && west update && west build west.yml pins SDK revision — high reproducibility
Freestanding (local SDK folder, ZEPHYR_BASE env var) export ZEPHYR_BASE=/path/to/sdk && west build Pre-install SDK on runner, or clone a specific version in CI
nRF Connect SDK + VS Code extension (GUI-based build) Build the same project via CLI: west build -b nrf52dk/nrf52832 Pin SDK version via env var or Docker image tag

The simplest way to put a freestanding project into CI is to pre-install the NCS SDK on the runner machine and set ZEPHYR_BASE:

# Freestanding project CI example (GitHub Actions)
jobs:
  hil-test:
    runs-on: self-hosted  # runner with pre-installed SDK
    env:
      ZEPHYR_BASE: /home/runner/ncs/v2.9.0/zephyr
    steps:
      - uses: actions/checkout@v4
      - run: west build -b nrf52dk/nrf52832
      - run: west twister --device-testing --hardware-map hardware-map.yml -T tests/
Enter fullscreen mode Exit fullscreen mode

The downside: the SDK version is tied to the runner machine. Updating the runner's SDK affects every project. That's exactly why T2 topology uses west.yml to pin SDK versions independently per project. But if you have a single project and just want to get CI running, freestanding is enough. You can upgrade the structure later.

Precedent: Golioth's Implementation

The implementation I referenced most while designing this pipeline was Golioth's HIL case study. Golioth, an IoT platform company, runs exactly this architecture — Raspberry Pi + GitHub Actions self-hosted runner + nRF52840dk — to execute automated HIL tests on every PR.

Key design decisions from Golioth:

  • Record all connected devices in hardware-map.yml. Serial port, device ID, platform, and runner info are managed in YAML. When a board is added or swapped, only this file needs updating.
  • Pre-stage WiFi/cloud credentials on the runner locally. No secrets in the repository. Setup files live on the runner machine, and the workflow references them.
  • Auto-detect connected boards. They wrote a script that automatically recognizes USB-connected boards and generates the hardware-map.yml. Physically swapping a board is reflected on the next CI run.

I didn't adopt this structure wholesale. Golioth is a cloud service company, so they validate network connectivity, authentication, and OTA (Over-the-Air firmware update) via HIL. My immediate need was simpler: "flash after build, verify basic behavior via serial logs." Scope your automation to match your actual needs.


Twister + Real Hardware — Writing and Running Tests

hardware-map.yml and testcase.yaml

Twister's --device-testing mode operates on two YAML files.

hardware-map.yml — physical board info connected to the runner:

# hardware-map.yml
- connected: true
  id: 000683459357        # J-Link serial number
  platform: nrf52dk/nrf52832
  product: J-Link
  runner: nrfjprog         # flashing tool
  serial: /dev/ttyACM0     # serial port
  baud: 115200
Enter fullscreen mode Exit fullscreen mode

Only boards with connected: true are included as test targets. The J-Link serial number (id) uniquely identifies each board, so multiple boards on the same runner don't conflict. Twister's hardware map currently supports the pyocd, nrfjprog, jlink, openocd, and dediprog runners. Other runners are still in progress.

testcase.yaml — test definition:

# tests/my_feature/testcase.yaml
tests:
  my_app.feature.basic:
    platform_allow:
      - nrf52dk/nrf52832
    harness: console
    harness_config:
      type: one_line
      regex:
        - "Feature initialized successfully"
        - "Self-test passed"
    tags:
      - feature
      - hil
Enter fullscreen mode Exit fullscreen mode

harness: console finds regex patterns in serial output to determine pass/fail. If "Feature initialized successfully" appears in the log, it passes. If the pattern doesn't appear within the timeout, it fails. Simple — but it catches more than you'd expect.

Execution command:

west twister --device-testing \
  --hardware-map hardware-map.yml \
  -T tests/ \
  -vv  # verbose output
Enter fullscreen mode Exit fullscreen mode

Twister automatically builds the firmware, flashes it to the board listed in hardware-map.yml, captures serial output, matches it against testcase.yaml conditions, and reports results. west flash internally calls nrfjprog, which uses the J-Link DLL. In headless environments, the process runs without firmware update dialogs.

What Serial Logs Can and Can't Catch

"So it just checks whether my predefined log messages appear?" Yes. And that catches more than you'd think.

When debugging via serial manually, there are two modes: watching logs scroll in real time and checking "this log should appear at this timing," or dumping logs to a file and searching for keywords later. Serial log verification in CI is closer to the latter — capture the entire log, then automatically check whether predefined patterns are present or absent.

Specific firmware scenarios this simple mechanism catches:

1. Boot initialization sequence verification

Firmware typically initializes subsystems in order at boot. BLE stack, then sensor driver, then application logic. Miss a Kconfig option, and a subsystem silently drops out. Manually, you might notice "the log looks shorter than usual" and move on. CI flags it immediately when the "BLE stack initialized" pattern is missing.

# Initialization sequence testcase
tests:
  boot.init_sequence:
    harness: console
    harness_config:
      type: multi_line
      ordered: true
      regex:
        - "\\[00:00:00.0\\d+\\] <inf> app: System starting"
        - "\\[00:00:00.\\d+\\] <inf> ble: BLE stack initialized"
        - "\\[00:00:00.\\d+\\] <inf> sensor: IMU ready"
        - "\\[00:00:01.\\d+\\] <inf> app: All subsystems up"
Enter fullscreen mode Exit fullscreen mode

type: multi_line with ordered: true means the patterns must appear in this exact order. Out of order or missing one — fail. I caught an issue this way when the AI refactored code and inadvertently changed the initialization order.

2. Automatic Hard Fault detection

Calling k_sleep() in ISR context or dereferencing a null pointer triggers a Hard Fault on ARM Cortex-M. Zephyr's default Fault Handler dumps registers to serial:

[00:00:01.234] <err> os: ***** HARD FAULT *****
[00:00:01.234] <err> os:   Fault escalation (see below)
[00:00:01.235] <err> os: r0/a1:  0x00000000  r1/a2:  0x20001234
[00:00:01.235] <err> os: Current thread: 0x20000458 (main)
Enter fullscreen mode Exit fullscreen mode

This is a pattern that must not appear. You can set it as a failure condition in testcase.yaml:

# Fail unconditionally on Hard Fault
tests:
  safety.no_hard_fault:
    harness: console
    harness_config:
      type: one_line
      regex:
        - "All self-tests passed"
      fail_on_fault: true  # default is true, but stated explicitly for clarity
Enter fullscreen mode Exit fullscreen mode

If I were watching the serial monitor myself, I'd spot the Hard Fault dump immediately. But without CI, tracing "which of the 5 commits pushed over the weekend broke it" is painful. CI running this test on every commit tells you exactly which commit introduced the fault — no git bisect needed.

3. Timing-based verification

Zephyr logs include timestamps. This lets you verify timing requirements like "BLE advertising must start within 2 seconds of boot":

# Verify advertising starts within 3 seconds of boot
tests:
  ble.adv_start_timing:
    harness: console
    harness_config:
      type: one_line
      regex:
        - "\\[00:00:0[0-2]\\.\\d+\\] <inf> ble: Advertising started"
    timeout: 5
Enter fullscreen mode Exit fullscreen mode

The regex [00:00:0[0-2]\\.\\d+] only matches timestamps between 0 and 2 seconds. If advertising starts after 3 seconds, the pattern doesn't match, and the test times out as a failure.

4. Memory usage regression detection

Enabling Zephyr's CONFIG_THREAD_ANALYZER periodically logs each thread's stack usage:

[00:00:05.000] <inf> thread_analyzer:  main    : STACK: unused 512 usage 1536 / 2048 (75 %); CPU: 12 %
[00:00:05.000] <inf> thread_analyzer:  ble_rx  : STACK: unused 128 usage 896 / 1024 (87 %); CPU: 3 %
Enter fullscreen mode Exit fullscreen mode

"unused 128" means only 128 bytes of stack headroom remain. You can pattern-match this and fail when headroom drops below a threshold — catching stack growth early as the AI adds code.

What this approach can't catch

Serial log pattern matching only verifies "logs I predicted in advance." Unexpected failures — BLE disconnecting after 30 minutes, sensor values drifting at certain temperatures — won't be caught unless you build tests that reproduce those specific conditions.

Real-time interactive debugging is also outside CI's scope. "Watch serial output while pressing a button at a specific moment" is still a desk job. CI's role is "automatically re-verify known correct behavior on every commit," not "discover new problems." When you do discover a new problem, you write a test for it and add it to CI — that's how test suites naturally grow thicker over time.

Automatable Tests vs. Non-automatable Tests

Not everything can be automated with HIL. Drawing the boundary clearly matters.

Automatable:

  • UART/RTT log output verification (string pattern matching)
  • State machine transition checks (log state changes, verify sequence)
  • Boot time measurement (timestamp-based)
  • I2C/SPI device response checks (when sensors are physically connected)
  • Memory usage reports (parsing the .map file generated at build time)

Difficult or impossible to automate:

  • BLE RF performance (RSSI, packet error rate) — requires dedicated test equipment
  • Analog sensor accuracy — requires a reference input source
  • Power consumption measurement — requires a current probe (Zephyr 4.2 added a power measurement harness to Twister, but it needs physical measurement hardware)
  • Long-duration stress tests — hits CI execution time limits
  • UI/display output — camera-based verification is possible but complex (Zephyr 4.3 added visual fingerprint matching)

I focused on the "automatable" list. The most common failure patterns in AI-generated code — boot initialization failures, features silently disabled by wrong Kconfig, Hard Faults from ISR context violations — are all catchable via serial logs. Aiming for perfection means never starting. "Automatically catching 80% of the most common failures" is the realistic goal.


Plugging CI into the AI Workflow — Closing the Loop

Research, Plan, Execute, Test, CI: The Final Workflow

Adding CI to the four-stage loop from series #4 produces this workflow:

Complete AI firmware development workflow — Research, Plan, Execute, PR, CI Pipeline with HIL testing, and AI feedback loop on failure

HIL CI joins the four-stage AI loop, forming a closed feedback loop

Creating a PR (Pull Request) triggers CI automatically. A build failure surfaces the build log; a test failure surfaces the serial log. Standard CI so far.

Feeding CI Failure Logs Back to the AI

The differentiator is the feedback loop on failure. When CI fails, I pass the serial logs to the AI for root cause analysis and fix suggestions.

A finding from series #4: "AI's accuracy is highest when analyzing logs." Logs are factual data, which leaves little room for hallucination. The same applies to CI-captured serial logs. Hand the AI a Hard Fault register dump, stack trace, and error codes, and it provides reasonably accurate analysis: "this address corresponds to this function at this offset, and the probable cause is X."

# Workflow example: save logs on CI failure (GitHub Actions)
- name: Save failure logs
  if: failure()
  run: |
    cp twister-out/*/handler.log artifacts/
    cp twister-out/*/device.log artifacts/

- name: Upload artifacts
  if: failure()
  uses: actions/upload-artifact@v4
  with:
    name: failure-logs
    path: artifacts/
Enter fullscreen mode Exit fullscreen mode

Feeding the saved logs to Claude Code:

# Request AI analysis of failure logs locally
claude "This Twister test failed in CI. Analyze device.log." \
  @artifacts/device.log
Enter fullscreen mode Exit fullscreen mode

This loop isn't fully automated yet. There's manual intervention between CI failure, log download, and handing it to the AI. Tools like GitLab Duo Root Cause Analysis are narrowing this gap, but no production tool yet auto-analyzes embedded firmware serial logs. [TBD: Need to add concrete experience of the CI failure → AI analysis → fix application cycle]

Reusing Skills and Hooks in CI

The Kconfig validation hook from series #3 — a script that greps build/zephyr/.config and Kconfig sources to catch nonexistent symbols when a .conf file is modified — also works in CI.

The approach is straightforward. Include the hook script in the repo and run it before the build step in the CI workflow:

# Run Kconfig validation hook in CI
- name: Validate Kconfig
  run: |
    west build -b nrf52dk/nrf52832
    ./scripts/validate_kconfig.sh prj.conf build/zephyr/.config
Enter fullscreen mode Exit fullscreen mode

Claude Code's skill fires when the AI modifies a .conf file; the CI validation catches it when a human edits .conf manually too. The same validation logic, running at two points. Tools created during AI collaboration naturally extending into CI infrastructure — that's the compounding effect of the pipeline built across this series.


Remaining Gaps and Next Steps

What HIL CI Still Can't Catch

I need to be honest. Adding HIL CI doesn't mean every hardware problem is automatically caught:

  • RF performance: BLE connection stability, RSSI, and packet error rate require measurement equipment (sniffer, spectrum analyzer). Serial logs only tell you "connection succeeded/failed," not "why it failed."
  • Long-term stability: Memory leaks and stack overflows only surface after hours or days of operation. CI workflows typically run for minutes to tens of minutes — too short to catch these.
  • Power consumption: Current profiles of sleep/wake cycles can't be measured without a current probe. Zephyr 4.2 added a power measurement harness to Twister, but it requires physical measurement hardware on the runner.
  • Multi-device interaction: BLE Central-Peripheral communication and mesh network behavior require controlling multiple boards simultaneously. Possible, but setup complexity escalates sharply.

Just as I noted in series #4 that "security-related code (encryption, Secure Boot, OTA signing) stays manually written," HIL CI also requires consciously defining the boundary between "what to automate" and "what a human verifies."

The Real Costs

Maintaining a HIL CI pipeline has costs. I won't sugarcoat them.

Minimum hardware:

  • Raspberry Pi 4 (~$55) + SD card + power adapter
  • nRF52 DK ($40) + USB cable
  • Total: ~$100 (one-time)

Hidden operational costs:

  • OS updates, security patches — neglect these and you have a security hole
  • SD card lifespan — heavy writes mean replacement every 1-2 years
  • USB connection instability — the board occasionally drops off and requires a physical reconnect
  • GitHub was expected to introduce a $0.002/min platform fee for self-hosted runners on private repos starting March 2026, but community pushback led to an indefinite postponement. Worth watching for future changes

For individuals or small teams running fewer than 1,000 builds per month, the cloud hosting cost savings are negligible. But if you've ever lost half a day to "the build passed but the board doesn't work," the $100 upfront investment pays for itself. Measure the value not in dollars, but in time and trust.

Reflecting on Five Posts

This post wraps up the technical content of the series. Here's the pipeline built across all five posts at a glance:

# Post Pipeline Layer
1 Antigravity IDE Development environment
2 NCS T2 Topology Project structure
3 Claude Code Skills + Hooks AI tooling
4 Research → Plan → Execute → Test Loop AI workflow
5 HIL CI (this post) Automated verification

Environment, structure, tooling, methodology, verification. Each layer stands on the one below it. The IDE isolates projects via T2 topology. Claude Code skills and hooks catch AI hallucinations on that foundation. The four-stage loop structures the workflow. And HIL CI verifies it all on real hardware.

I know this setup isn't perfect. But going from "I tried having AI write firmware and it didn't work" to "a repeatable process for building firmware with AI" — that's real progress.

The next post will look back at the entire five-post journey and distill what I learned at the intersection of AI and embedded firmware development — what worked, and what remains firmly in the human domain.

Top comments (0)