<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SiliconLogiX</title>
    <description>The latest articles on DEV Community by SiliconLogiX (siliconlogix).</description>
    <link>https://dev.to/siliconlogix</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F13299%2F6a7097dd-4801-4ce6-9d89-f1dd6fe72af7.png</url>
      <title>DEV Community: SiliconLogiX</title>
      <link>https://dev.to/siliconlogix</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/siliconlogix"/>
    <language>en</language>
    <item>
      <title>Firmware Black Box: diagnosing embedded resets in the field</title>
      <dc:creator>Marco</dc:creator>
      <pubDate>Tue, 30 Jun 2026 09:26:29 +0000</pubDate>
      <link>https://dev.to/siliconlogix/firmware-black-box-diagnosing-embedded-resets-in-the-field-15ml</link>
      <guid>https://dev.to/siliconlogix/firmware-black-box-diagnosing-embedded-resets-in-the-field-15ml</guid>
      <description>&lt;p&gt;A device that resets in the field is not always the hardest problem.&lt;/p&gt;

&lt;p&gt;The harder problem is a device that resets, comes back online, and leaves no evidence about what happened before the reboot.&lt;/p&gt;

&lt;p&gt;That is where a firmware black box becomes useful.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is the DEV.to edition of a Silicon LogiX technical article. The canonical English source is linked at the end.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What a firmware black box is
&lt;/h2&gt;

&lt;p&gt;A firmware black box is a small diagnostic subsystem inside the firmware. Its job is to preserve enough information to support post-mortem analysis after a reset, watchdog event, HardFault, panic or unexpected reboot.&lt;/p&gt;

&lt;p&gt;It does not need to record everything. It needs to record the data that helps answer the first diagnostic questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;why did the device reset?&lt;/li&gt;
&lt;li&gt;how long had it been running?&lt;/li&gt;
&lt;li&gt;which firmware build was installed?&lt;/li&gt;
&lt;li&gt;what state was the application in?&lt;/li&gt;
&lt;li&gt;which task was active?&lt;/li&gt;
&lt;li&gt;did the watchdog fire?&lt;/li&gt;
&lt;li&gt;did memory, stack or heap margins collapse?&lt;/li&gt;
&lt;li&gt;did the network, modem, BLE, Wi-Fi or OTA flow fail just before the reboot?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that data, every field reset deletes most of the evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why sporadic resets are expensive
&lt;/h2&gt;

&lt;p&gt;Rare embedded bugs are often more expensive than obvious failures.&lt;/p&gt;

&lt;p&gt;A crash that happens every time in the same function can usually be analyzed with a debugger, logs and a repeatable test. A reset that appears once every ten days on a customer device is different.&lt;/p&gt;

&lt;p&gt;The cause may depend on a combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;temperature&lt;/li&gt;
&lt;li&gt;unstable power&lt;/li&gt;
&lt;li&gt;brown-out&lt;/li&gt;
&lt;li&gt;cable length&lt;/li&gt;
&lt;li&gt;enclosure heating&lt;/li&gt;
&lt;li&gt;network drops&lt;/li&gt;
&lt;li&gt;modem state&lt;/li&gt;
&lt;li&gt;memory fragmentation&lt;/li&gt;
&lt;li&gt;stack exhaustion&lt;/li&gt;
&lt;li&gt;long uptime&lt;/li&gt;
&lt;li&gt;race conditions&lt;/li&gt;
&lt;li&gt;a peripheral that stops responding&lt;/li&gt;
&lt;li&gt;an OTA edge case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the lab, the product may look clean. In the field, the environment changes. The customer report often becomes: "it rebooted", "it stopped communicating", or "we had to power-cycle it".&lt;/p&gt;

&lt;p&gt;That is not enough for firmware diagnosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to capture
&lt;/h2&gt;

&lt;p&gt;A good first version does not need to be large. Start with a compact structure that survives the next boot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reset reason&lt;/li&gt;
&lt;li&gt;uptime before reset&lt;/li&gt;
&lt;li&gt;firmware version and build ID&lt;/li&gt;
&lt;li&gt;hardware variant&lt;/li&gt;
&lt;li&gt;application state&lt;/li&gt;
&lt;li&gt;last meaningful error&lt;/li&gt;
&lt;li&gt;watchdog counters&lt;/li&gt;
&lt;li&gt;recent application events&lt;/li&gt;
&lt;li&gt;active RTOS task&lt;/li&gt;
&lt;li&gt;stack high-water marks&lt;/li&gt;
&lt;li&gt;minimum heap&lt;/li&gt;
&lt;li&gt;OTA state&lt;/li&gt;
&lt;li&gt;network state&lt;/li&gt;
&lt;li&gt;fault registers or core dump when available&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For FreeRTOS systems, values such as &lt;code&gt;uxTaskGetStackHighWaterMark&lt;/code&gt; can be extremely useful when stack margins are suspected.&lt;/p&gt;

&lt;p&gt;For Cortex-M systems, fault status registers, stack pointer information and the program counter can turn a blind HardFault into a useful post-mortem report.&lt;/p&gt;

&lt;p&gt;For ESP32 products, ESP-IDF already provides useful mechanisms such as panic handling and core dump. The best design often combines those built-in tools with application-level event history and upload after reboot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watchdog is not enough
&lt;/h2&gt;

&lt;p&gt;The watchdog is essential. It can recover service when firmware stops responding.&lt;/p&gt;

&lt;p&gt;But if the watchdog resets the system and nothing is stored, the device is operational again while the root cause has disappeared.&lt;/p&gt;

&lt;p&gt;A good watchdog restarts the device.&lt;/p&gt;

&lt;p&gt;A good firmware black box explains why the restart was necessary.&lt;/p&gt;

&lt;p&gt;Useful questions include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which task failed to report health?&lt;/li&gt;
&lt;li&gt;was the device reconnecting to the cloud?&lt;/li&gt;
&lt;li&gt;was the modem stuck?&lt;/li&gt;
&lt;li&gt;was heap decreasing over time?&lt;/li&gt;
&lt;li&gt;was an OTA update in progress?&lt;/li&gt;
&lt;li&gt;were there repeated DNS, TLS, MQTT or BLE errors?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the difference between recovery and diagnosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging and diagnostics are different
&lt;/h2&gt;

&lt;p&gt;Many firmware projects have logs. That does not mean they are diagnosable.&lt;/p&gt;

&lt;p&gt;UART logs are useful during development, but they often disappear in production. If nobody had a terminal connected when the customer device reset, those logs are gone.&lt;/p&gt;

&lt;p&gt;Diagnostics should be designed for field conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no debugger attached&lt;/li&gt;
&lt;li&gt;no serial cable connected&lt;/li&gt;
&lt;li&gt;no engineer present at the crash&lt;/li&gt;
&lt;li&gt;limited memory&lt;/li&gt;
&lt;li&gt;possible power loss&lt;/li&gt;
&lt;li&gt;possible network failure&lt;/li&gt;
&lt;li&gt;support team needs a readable report&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A compact persistent ring buffer with the latest important events may be more valuable than thousands of UART lines that nobody will ever see.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage options
&lt;/h2&gt;

&lt;p&gt;The right storage depends on the product:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retention RAM for fast data across software resets&lt;/li&gt;
&lt;li&gt;internal flash for small critical reports&lt;/li&gt;
&lt;li&gt;NVS or a dedicated partition on ESP32&lt;/li&gt;
&lt;li&gt;EEPROM or FRAM for frequent diagnostic counters&lt;/li&gt;
&lt;li&gt;local filesystem on embedded Linux gateways&lt;/li&gt;
&lt;li&gt;cloud upload after reboot for connected devices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Do not make the cloud the only source of truth. The network may be part of the problem.&lt;/p&gt;

&lt;p&gt;Also be careful with flash wear, power loss during writes, atomic updates and sensitive data. Diagnostic reports should not contain passwords, private keys, tokens or unnecessary personal data.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical checklist
&lt;/h2&gt;

&lt;p&gt;Before shipping a connected embedded product, ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Is reset reason collected at boot?&lt;/li&gt;
&lt;li&gt;[ ] Are watchdog, brown-out, software reset and manual reset distinguished?&lt;/li&gt;
&lt;li&gt;[ ] Is firmware version tied to a build ID or commit?&lt;/li&gt;
&lt;li&gt;[ ] Is hardware variant recorded?&lt;/li&gt;
&lt;li&gt;[ ] Is there a useful HardFault or panic strategy?&lt;/li&gt;
&lt;li&gt;[ ] Are RTOS task health and stack margins observable?&lt;/li&gt;
&lt;li&gt;[ ] Is minimum heap recorded?&lt;/li&gt;
&lt;li&gt;[ ] Is there a persistent event ring buffer?&lt;/li&gt;
&lt;li&gt;[ ] Are network and OTA events included?&lt;/li&gt;
&lt;li&gt;[ ] Can support export a report without JTAG?&lt;/li&gt;
&lt;li&gt;[ ] Are sensitive data and credentials excluded?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Typical field scenario
&lt;/h2&gt;

&lt;p&gt;Imagine an STM32 or ESP32 controller. It passes lab tests, talks to sensors, sends data to the cloud and supports OTA. The watchdog is enabled, so the team feels safe.&lt;/p&gt;

&lt;p&gt;After release, some customers report random reboots.&lt;/p&gt;

&lt;p&gt;Without a black box, the team guesses: maybe the modem, maybe a task stack, maybe power, maybe memory, maybe I2C, maybe Wi-Fi.&lt;/p&gt;

&lt;p&gt;With a black box, the next episode can produce a report:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;watchdog reset&lt;/li&gt;
&lt;li&gt;uptime: 91 hours&lt;/li&gt;
&lt;li&gt;application state: MQTT reconnect&lt;/li&gt;
&lt;li&gt;network task stack almost exhausted&lt;/li&gt;
&lt;li&gt;several DNS timeouts in the recent event buffer&lt;/li&gt;
&lt;li&gt;firmware build ID identified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not the final fix, but it is a direction. In embedded debugging, a direction can save days or weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final takeaway
&lt;/h2&gt;

&lt;p&gt;A firmware black box is not an extra feature for perfectionists. It is part of building maintainable embedded products.&lt;/p&gt;

&lt;p&gt;A device in the field should not only recover. It should explain what happened.&lt;/p&gt;

&lt;p&gt;When diagnostics are designed with reset reason, watchdog analysis, fault context, RTOS metrics, persistent events, OTA state and safe export, sporadic resets stop being mysterious and become measurable.&lt;/p&gt;

&lt;p&gt;And a measurable problem is much closer to a fix.&lt;/p&gt;




&lt;p&gt;Canonical source: &lt;a href="https://www.siliconlogix.it/en/article/firmware-black-box-how-to-find-out-why-an-embedded-device-resets-in-the-field" rel="noopener noreferrer"&gt;Firmware Black Box: how to find out why an embedded device resets in the field&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you build embedded, IoT or firmware products and want a second pair of eyes on diagnostics, watchdog strategy, OTA behavior or field failures, Silicon LogiX can help turn hard-to-reproduce issues into measurable engineering work.&lt;/p&gt;

</description>
      <category>firmware</category>
      <category>embedded</category>
      <category>iot</category>
      <category>debugging</category>
    </item>
  </channel>
</rss>
