DEV Community

chinmay-02
chinmay-02

Posted on

I Built Sentry.io for Microcontrollers — Here's How the Crash Capture Actually Works

Dashboard firmwaresentry.io ·
SDK: github.com/chinmay-02/firmware-sentry-sdk

Interactive demo: Play with the live dashboard right here:

Eight years into firmware engineering — Product companies, IoT startups, contract work — and the debugging story never changed. Device crashes in the field. Customer calls. You ask them to reproduce it. They can't. You squint at a log file with no context. You ship a guess.

The worst part: the information was there. The CPU captured it. The fault registers had the exact address, the exact cause, the exact task. It just didn't survive the reboot.

That's the problem I set out to fix. The result is Firmware Sentry — an open-source C SDK that captures the full fault state on crash, persists it across reset, sends it to the cloud, and gets Claude AI to diagnose it for you. Like Sentry.io, but for microcontrollers.

This post is about the hardest part: getting crash data to survive a reboot and reliably transmit back to the cloud.


The Core Problem: Reboot Destroys Evidence

When an ESP32 panics, here's the execution chain:

  1. CPU takes an exception
  2. esp_panic_handler() is called
  3. It prints a backtrace to UART
  4. Hard reset

Everything in DRAM is zeroed on reset by the C startup code. By the time your app_main() runs again, the registers are gone, the stack is overwritten, the task name is lost.

The only memory that survives a software reset on ESP32 is RTC slow memory — specifically RTC_NOINIT_ATTR variables, which the startup code explicitly skips during zero-fill.

// These survive SW_CPU_RESET. They do NOT survive power-on reset.
static RTC_NOINIT_ATTR uint32_t s_crash_magic;
static RTC_NOINIT_ATTR uint32_t s_crash_pc;
static RTC_NOINIT_ATTR uint32_t s_crash_cause;
static RTC_NOINIT_ATTR uint32_t s_crash_sp;
static RTC_NOINIT_ATTR uint32_t s_crash_a0;
static RTC_NOINIT_ATTR uint32_t s_crash_vaddr;
static RTC_NOINIT_ATTR uint32_t s_crash_uptime;
static RTC_NOINIT_ATTR uint32_t s_crash_task_hwm;
static RTC_NOINIT_ATTR char     s_crash_task[32];

// Breadcrumb ring buffer — also survives reset
static RTC_NOINIT_ATTR fs_event_t s_rtc_events[10];
static RTC_NOINIT_ATTR int        s_rtc_event_head;
static RTC_NOINIT_ATTR int        s_rtc_event_count;

// Total RTC usage: ~601 bytes out of 8KB available
Enter fullscreen mode Exit fullscreen mode

The magic number check is critical. On power-on reset, RTC memory contains random garbage. You need a sentinel value to know whether the data you're reading is a real crash or uninitialized noise:

#define FS_CRASH_MAGIC 0xDEADC0DE

bool fs_hal_esp32_has_pending_crash(void) {
    return s_crash_magic == FS_CRASH_MAGIC;
}
Enter fullscreen mode Exit fullscreen mode

On power-on, the magic won't match. On software reset after a crash, it will. This single check is the gate between "real crash data" and "random bytes."


Intercepting the Panic Handler

The standard approach — overriding HardFault_Handler on ARM — doesn't exist on Xtensa. ESP-IDF's panic handler is a regular C function. But we can use the GNU linker --wrap trick to intercept it:

# In your CMakeLists.txt
target_link_options(${COMPONENT_TARGET} INTERFACE
    -Wl,--wrap=esp_panic_handler
)
Enter fullscreen mode Exit fullscreen mode
// Our interceptor. Called before ESP-IDF's handler.
void __wrap_esp_panic_handler(panic_info_t *info)
{
    // Capture everything while we still can
    fs_hal_esp32_capture_crash(info);

    // Let ESP-IDF print its backtrace and reboot normally
    __real_esp_panic_handler(info);
}
Enter fullscreen mode Exit fullscreen mode

This is clean — we don't replace ESP-IDF's handler, we wrap it. The original handler still runs, prints its backtrace to UART, and performs the reset. We just intercept it first.


Extracting Registers from panic_info_t

This is where it gets platform-specific. The panic_info_t struct contains a void *frame pointer, but what that frame actually is depends on the CPU architecture.

On ESP32 (Xtensa LX6):

static void fs_hal_esp32_capture_crash(panic_info_t *info)
{
    if (!info || !info->frame) return;

    // Get PC from the panic info — this matches exactly what
    // IDF's monitor tool reports
    s_crash_pc = (uint32_t)panic_get_address(info->frame);

    // Cast to Xtensa exception frame for the rest
    XtExcFrame *frame = (XtExcFrame *)info->frame;
    s_crash_sp     = frame->a1;   // A1 is the stack pointer on Xtensa
    s_crash_a0     = frame->a0;   // A0 is the return address / LR equivalent
    s_crash_cause  = frame->exccause;
    s_crash_vaddr  = frame->excvaddr;  // The bad memory address on a load/store fault

    // Capture runtime context
    s_crash_uptime = (uint32_t)(esp_timer_get_time() / 1000);

    // Get the name of the FreeRTOS task that crashed
    TaskHandle_t task = xTaskGetCurrentTaskHandle();
    if (task) {
        const char *name = pcTaskGetName(task);
        strncpy(s_crash_task, name, sizeof(s_crash_task) - 1);
        s_crash_task_hwm = uxTaskGetStackHighWaterMark(task);
    }

    // Set the magic — now this data is valid
    s_crash_magic = FS_CRASH_MAGIC;
}
Enter fullscreen mode Exit fullscreen mode

The key insight here: frame->a1 on Xtensa is the stack pointer, not a0. On ARM this would be different. The HAL layer is exactly what abstracts this difference — the same fs_init() call works on both architectures because each platform's HAL knows its own register layout.


The Breadcrumb System

Registers tell you where it crashed. Breadcrumbs tell you what it was doing before it crashed.

The concept is simple: your firmware logs short string events to a ring buffer in RTC memory throughout its operation. When it crashes, those events survive the reset and get included in the crash report.

// Scatter these throughout your application code
fs_log_event("boot");
fs_log_event("nvs_init_ok");
fs_log_event("wifi_connected");
fs_log_event("mqtt_sub_ok");
fs_log_event("ota_check_start");
fs_log_event("ota_dl_begin");
// ...crash occurs here

// Cloud dashboard shows:
// boot → nvs_init_ok → wifi_connected → mqtt_sub_ok → ota_check_start → ota_dl_begin ← crash
Enter fullscreen mode Exit fullscreen mode

The ring buffer implementation uses integer head/count indices (also in RTC NOINIT) and overwrites the oldest entry when full:

void fs_log_event(const char *event) {
    if (!event || !*event) return;

    int idx = (s_rtc_event_head + s_rtc_event_count) % FS_MAX_EVENTS;

    strncpy(s_rtc_events[idx].event, event, FS_MAX_EVENT_LEN - 1);
    s_rtc_events[idx].event[FS_MAX_EVENT_LEN - 1] = '\0';
    s_rtc_events[idx].ts = (uint32_t)(esp_timer_get_time() / 1000);

    if (s_rtc_event_count < FS_MAX_EVENTS) {
        s_rtc_event_count++;
    } else {
        // Ring is full — overwrite oldest
        s_rtc_event_head = (s_rtc_event_head + 1) % FS_MAX_EVENTS;
    }
}
Enter fullscreen mode Exit fullscreen mode

No malloc. No mutex. No RTOS dependency. Works from any context, including interrupt handlers (with the caveat that you're in a faulting state — you want this to be as simple as possible).


The Build Hash: Linking Crash to Binary

One of the subtle problems in crash reporting is knowing which firmware build a crash came from. Firmware version strings are developer-set and often wrong ("I forgot to bump the version"). Build hashes are automatic.

On ESP32, the IDF computes a SHA256 hash of the entire firmware binary and stores it in the app descriptor. We capture it at startup with a constructor attribute:

static char s_build_hash[9];  // 8 hex chars + null

__attribute__((constructor))
static void capture_build_hash(void) {
    uint8_t sha256[32];
    esp_app_get_elf_sha256(sha256, sizeof(sha256));
    snprintf(s_build_hash, sizeof(s_build_hash),
             "%02x%02x%02x%02x",
             sha256[0], sha256[1], sha256[2], sha256[3]);
}
Enter fullscreen mode Exit fullscreen mode

The __attribute__((constructor)) fires before app_main(). This means the hash is captured before any application code runs — even before the RTOS scheduler starts. The first 8 hex characters (32 bits of SHA256) give you enough uniqueness to identify a build without sending the full hash.

This hash goes into every crash report. When you upload your ELF file to the cloud dashboard, it's indexed by this hash. Symbol resolution (addr2line over DWARF) links the crash's PC address to the exact function name, file, and line number in that specific build.


Sending the Crash

On boot, fs_init() checks the magic number. If there's a pending crash, it bundles everything into a JSON payload and POSTs it over HTTPS:

esp_err_t fs_init(void) {
    // ... wifi must be up before calling this ...

    if (fs_hal_has_pending_crash()) {
        ESP_LOGI(TAG, "Pending crash detected — sending to cloud");

        char payload[2048];
        fs_build_payload(payload, sizeof(payload));

        esp_err_t err = fs_https_post(FS_ENDPOINT_CRASHES, payload);
        if (err == ESP_OK) {
            fs_hal_clear_crash();  // Clear magic so we don't send again
            ESP_LOGI(TAG, "Crash report sent successfully");
        }
    }

    return ESP_OK;
}
Enter fullscreen mode Exit fullscreen mode

The payload includes every captured register, the breadcrumb trail, task name, stack high watermark, uptime at crash, build hash, and firmware version. The cloud endpoint receives it, stores it in Postgres, and immediately triggers AI diagnosis in the background.


What Happens on the Cloud

The crash lands in Supabase PostgreSQL with all the raw registers. A background task then:

  1. Checks if an ELF file exists for this build hash
  2. If yes: runs pyelftools DWARF lookup on the PC address to get function name, file, and line number
  3. Builds a context-rich prompt and calls Claude Sonnet
  4. Stores the AI diagnosis back to the crash record

The AI prompt includes the decoded CFSR/HFSR fault bits in human-readable form (not raw hex), the breadcrumb trail, the resolved function name if available, and the task stack high watermark. That last one is surprisingly useful — a high watermark near zero means stack overflow, which explains a lot of seemingly random crashes.

Claude's response looks like:

Root cause: Null pointer dereference in mqtt_publish_task at mqtt_client.c:247

The EXCVADDR register (0x00000004) indicates a load from address 0x4, which is a null pointer plus 4-byte offset — consistent with dereferencing a struct pointer where the struct's second field is being accessed. The breadcrumb trail shows the device had just completed an OTA download (ota_dl_begin at +3,847ms) before crashing in mqtt_publish_task. This suggests the OTA process may have corrupted or freed a message queue handle that the MQTT task continued to use. Check that OTA completion properly re-initializes the MQTT client handle before the publisher task resumes.

That's a real diagnosis, not a template. It uses the actual register values and breadcrumb sequence.


ARM Cortex-M: The Same Interface, Different Internals

The same fs_init() / fs_log_event() API works on STM32 because the HAL layer abstracts the differences:

ESP32 (Xtensa) STM32L476 (Cortex-M4)
Fault hook __wrap_esp_panic_handler HardFault_Handler weak override
Registers EXCCAUSE, EXCVADDR, A0-A15 CFSR, HFSR, MMFAR, BFAR, R0-R12
Persistent storage RTC_NOINIT_ATTR (~8KB RTC RAM) SRAM2 top 2KB at 0x10007800
Transport HTTPS direct (WiFi) UART → gateway.py → cloud
API key storage NVS flash Flash page 255

The SRAM2 placement deserves a call-out: C startup code zeros the entire SRAM2 region on reset. You can't just put your persistent data anywhere in SRAM2 — you have to reserve a region in your linker script that the startup code explicitly skips. Getting this wrong means your crash data silently vanishes on every reset. This took a full debugging session on real hardware to catch.


Open Source

The SDK is Apache 2.0 licensed. It's designed to drop into an existing ESP-IDF or STM32 HAL project with minimal changes to CMakeLists.txt (ESP32) or Core/Src (STM32).

GitHub: firmware-sentry-sdk

// ESP32 — three lines to integrate
#include "firmware_sentry.h"

void app_main(void) {
    nvs_flash_init();
    // ... wifi connect ...
    fs_init();  // Sends any pending crash, registers fault hooks

    // Scatter breadcrumbs throughout your code
    fs_log_event("boot_complete");

    // Your application starts here
}
Enter fullscreen mode Exit fullscreen mode

The cloud dashboard is at firmware-sentry.vercel.app — create a free account, add a device, drop the SDK in.


If you've ever shipped a device that crashed in the field and had no idea why, I'd love to hear about it in the comments. What was your worst debugging war story?


Try It


Follow for the next post: how symbol resolution works with DWARF and pyelftools, and why amalgamated builds break it.

Top comments (0)