The 55.6% problem: why frontier LLMs fail at embedded code

#iot #ai #age #mcp

55.6%.

That's DeepSeek-R1's pass@1 on EmbedBench when it gets a circuit schematic alongside the task description. 50.0% without the schematic. Best score from the best reasoning model on the first comprehensive benchmark for LLMs in embedded systems development. Cross-platform migration to ESP-IDF tops out at 29.4%, set by Claude 3.7 Sonnet (Thinking).

Take a second with that. The same models that one-shot a Next.js app are coin-flipping firmware. And the benchmark only tested three boards.

That 1,553 number is the live count from pio boards --json-output against PlatformIO Core 6.1.18 on the day this post was written, and PlatformIO-MCP wraps that catalog directly. So when we say "1,553 boards," we mean an MCP server you can npx-install today that knows how to build, flash, and monitor against any of them.

What EmbedBench actually measures

EmbedAgent (Wang et al., 2025) is the paper. EmbedBench is the benchmark. 126 cases, nine electronic components, three hardware platforms -- Arduino Uno, ESP32, Raspberry Pi Pico. The authors evaluate LLMs across three roles a real embedded engineer plays: Programmer (write code given a schematic), Architect (design the circuit and the code from a task description), Integrator (port working code from one platform to another). Scoring is pass@1 against the Wokwi virtual circuit simulator. Either the test passes or it doesn't.

The methodology is sound; the simulator is deterministic, the harness automated, and the metric leaves no wiggle room for graded partial credit. So while the numbers are real, they're also incomplete.

What the harness can't see

EmbedBench is single-shot in a simulator. The model gets the task once and writes the answer once. There is no compiler error fed back, no error: 'GPIO_NUM_45' was not declared in this scope for the model to read and react to. There is no flash to a real board and no serial monitor to confirm the LED actually blinked at the rate it was supposed to. The Architect role can hand back a circuit with the wrong pin mapping and never find out until the test fails.

That isn't how anyone writes firmware. Embedded development is iterative.

You compile, you read the toolchain noise, you fix the missing include, you flash, you watch the serial log, you change the baud rate, you fix the off-by-one in the timer ISR, you flash again. The benchmark measures the cold-start guess and reports it as if it were the whole loop.

The paper's own failure modes back this up. LLMs flunked 7-segment displays because they got voltage levels and segment mappings wrong. They flunked push buttons because they didn't handle debounce. They flunked ESP-IDF migration because they hallucinated syntax for a framework they've barely seen in training. Every one of those failures is the kind of thing a build, a flash, and a serial print would catch on the second or third iteration.

What changes with a real loop

This is the part of the story PlatformIO-MCP fills in. The MCP server gives an agent the four tools that make iteration possible: it can build a project, push the binary onto a board, watch the serial line, and ask which boards are even connected. None of those tools fixes the underlying knowledge gap in ESP-IDF or 7-segment voltage tables; what they fix is the absence of a feedback signal. A model that ships on the third try beats a model that gets it perfect on the first try, every time. The loop is the difference.

A note for the secure-customer crowd

The teams writing firmware for these platforms tend to be the same teams with stingent security requirements: defense, aerospace, medical, industrial. MCP plus local inference plus an open source agent plus the on-prem PlatformIO toolchain is a stack that clears the bar for these environments. The benchmark numbers and the deployment story end up being the same conversation.

Try the loop

npx pio-mcp dashboard

platformio-mcp is on npm, repo at github.com/jl-codes/platformio-mcp. One command, real boards, the whole feedback loop the benchmark didn't measure. If you've run an agent against an ESP32 or an STM32 and have data to share, DM me on X